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About This Manual 


Preface 


Read This First 


This manual is a reference for programming TMS320C6x digital signal proces- 
sor (DSP) devices. 


Before you use this book, you should install your code generation and debug- 
ging tools. 


This book is organized in four major parts: 


L 


Part I: Introduction includes a brief description of the ’C6x architecture 
and code development flow. It also includes a tutorial that introduces you 
to the tools you will use in each phase of development and an optimization 
checklist to help you achieve optimal performance from your code. 


Part Il: C Code includes C code examples and discusses optimization 
methods for the code. This information can help you choose the most 
appropriate optimization techniques for your code. 


Part lll: Assembly Code describes the structure of assembly code. It pro- 
vides examples and discusses optimizations for assembly code. It also in- 
cludes a chapter on interrupt subroutines. 


PartIV: Appendix provides extensive code examples from the GSM EFR 
vocoder. 
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Related Documentation From Texas Instruments 


The following books describe the TMS320C6x devices and related support 
tools. To obtain a copy of any of these TI documents, call the Texas Instru- 
ments Literature Response Center at (800) 477-8924. When ordering, please 
identify the book by its title and literature number. 


TMS320C6x Assembly Language Tools User’s Guide (literature number 
SPRU186) describes the assembly language tools (assembler, linker, 
and other tools used to develop assembly language code), assembler 
directives, macros, common object file format, and symbolic debugging 
directives for the ’C6x generation of devices. 


TMS320C6x Optimizing C Compiler User’s Guide (literature number 
SPRU187) describes the ’C6x C compiler and the assembly optimizer. 
This C compiler accepts ANSI standard C source code and produces as- 
sembly language source code for the ‘C6x generation of devices. The as- 
sembly optimizer helps you optimize your assembly code. 


TMS320C6x C Source Debugger User’s Guide (literature number 
SPRU188) tells you how to invoke the ’C6x simulator and emulator 
versions of the C source debugger interface. This book discusses 
various aspects of the debugger, including command entry, code 
execution, data management, breakpoints, profiling, and analysis. 


TMS320C62x/C67x CPU and Instruction Set Reference Guide (literature 
number SPRU189) describes the ’C62x/C67x CPU architecture, instruc- 
tion set, pipeline, and interrupts for these digital signal processors. 


TMS320 DSP Designer’s Notebook: Volume 1 (literature number 
SPRT125) presents solutions to common design problems using ’C2x, 
’C8x, ’C4x, 'C5x, and other TI DSPs. 


TMS320C6201/C6701 Peripherals Reference Guide (literature number 
SPRU190) describes common peripherals available on the 
TMS320C6201/C6701 digital signal processors. This book includes in- 
formation on the internal data and program memories, the external 
memory interface (EMIF), the host port, serial ports, direct memory 
access (DMA), clocking and phase-locked loop (PLL), and the power- 
down modes. 


TMS320C6201 Digital Signal Processor Data Sheet (literature number 
SPRS051) describes the features of the TMS320C6201 and provides 
pinouts, electrical specifications, and timings for the device. 


Trademarks 


Trademarks 


Solaris and SunOS are trademarks of Sun Microsystems, Inc. 
VelociTl is a trademark of Texas Instruments Incorporated. 


Windows and Windows NT are registered trademarks of Microsoft 
Corporation. 
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If You Need Assistance 


If You Need Assistance .. . 


— World-Wide Web Sites 


TI Online http://www.ti.com 

Semiconductor Product Information Center (PIC) —_http://www.ti.com/sc/docs/pic/home.htm 
DSP Solutions http://www.ti.com/dsps 

320 Hotline On-line™ http://www.ti.com/sc/docs/dsps/support.htm 


North America, South America, Central America 

Product Information Center (PIC) 972) 644-5580 

TI Literature Response Center U.S.A. 800) 477-8924 
Software Registration/Upgrades 214) 638-0333 Fax: (214) 638-7742 
U.S.A. Factory Repair/Hardware Upgrades 281) 274-2285 
U.S. Technical Training Organization 972) 644-5580 
DSP Hotline 281) 274-2320 Fax: (281) 274-2324 Email: dsph@ti.com 
DSP Modem BBS (281) 274-2323 

DSP Internet BBS via anonymous ftp to ftp://ftp.ti.com/pub/tms320bbs 


Europe, Middle East, Africa 
European Product Information Center (EPIC) Hotlines: 
Multi-Language Support +33 13070 11 69 : +33 130 70 10 32 
Email: epic@ti.com 
Deutsch +49 8161 80 33 11 or +33 1 30 70 11 68 
English +33 1 30 70 11 65 
Francais +33 13070 11 64 
Italiano +33 1 30 70 11 67 
EPIC Modem BBS +33 1 30 70 11 99 
European Factory Repair +33 4 93 22 25 40 
Europe Customer Training Helpline : +49 81 61 80 40 10 


Asia-Pacific 

Literature Response Center +852 2956 7288 Fax: +852 2 956 2200 
Hong Kong DSP Hotline +852 2956 7268 Fax: +852 2 956 1002 
Korea DSP Hotline +82 2551 2804 Fax: +82 2551 2828 
Korea DSP Modem BBS +82 2551 2914 

Singapore DSP Hotline Fax: +65 390 7179 
Taiwan DSP Hotline +886 23771450 Fax: +886 2 377 2718 
Taiwan DSP Modem BBS +886 2 376 2592 

Taiwan DSP Internet BBS via anonymous ftp to ftp://dsp.ee.tit.edu.tw/pub/T1/ 


Japan 

Product Information Center +0120-81-0026 (in Japan) Fax: +0120-81-0036 (in Japan) 
+03-3457-0972 or (INTL) 813-3457-0972 Fax: +03-3457-1259 or (INTL) 813-3457-1259 

DSP Hotline +03-3769-8735 or (INTL) 813-3769-8735 Fax: +03-3457-7071 or (INTL) 813-3457-7071 

DSP BBS via Nifty-Serve Type “Go TIASP” 


Documentation 


When making suggestions or reporting errors in documentation, please include the following information that is on the title 
page: the full title of the book, the publication date, and the literature number. 
Mail: Texas Instruments Incorporated Email: dsph@ti.com 
Technical Documentation Services, MS 702 
P.O. Box 1443 
Houston, Texas 77251-1443 


Note: When calling a Literature Response Center to order documentation, please specify the literature number of the 
book. 
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Chapter 1 


Introduction 


Part I 


This chapter introduces some features of the ‘C6x microprocessor and 
discusses the basic process for creating code. Any reference to ’C6x pertains 
to both the ’C62x (fixed-point) and the 'C67x (floating-point) devices. All tech- 
niques are applicable to both devices, even though most of the examples 
shown are fixed-point specific. 


Topic Page 
lr MSS 20G6xcArchitectunremrrrrreriiier teria sitet ettern tet ty tele 1-2 
1.2 TMS320CG6x'Pipeline 2.2... 52552. scc0e cneeeceeee cee esas sene ces 1-2 
1.3 Code Development Flow to Increase Performance ............... 1-3 


TMS320C62xx Architecture / TMS320C62xx Pipeline 


1.1 TMS320C6x Architecture 


Part I 


The 'C62x is a fixed-point digital signal processor (DSP) and is the first DSP 
to use the VelociTI™ architecture. VelociTI is a high-performance, advanced 
very-long-instruction-word (VLIW) architecture, making it an excellent choice 
for multichannel, multifunction, and performance-driven applications. 


The ’C67x is a floating-point DSP with the same features. It is the second DSP 
to use the VelociT!™ architecture. 


The ’C6x DSPs are based on the ’C6x CPU, which consists of: 


as a es 


Program fetch unit 

Instruction dispatch unit 

Instruction decode unit 

Two data paths, each with four functional units 
Thirty-two 32-bit registers 

Control registers 

Control logic 

Test, emulation, and interrupt logic 


1.2 TMS320C6x Pipeline 


The ’C6x pipeline has several features that provide optimum performance, low 
cost, and simple programming. 


a 


Ly 


Increased pipelining eliminates traditional architectural bottlenecks in pro- 
gram fetch, data access, and multiply operations. 


Pipeline control is simplified by eliminating pipeline locks. 
The pipeline can dispatch eight parallel instructions every cycle. 


Parallel instructions proceed simultaneously through the same pipeline 
phases. 


Code Development Flow to Increase Performance 


1.3 Code Development Flow to Increase Performance 


You can achieve the best performance from your ’C6x code if you follow this 
flow when you are writing and debugging your code: 


Phase 1: Write C code 
Develop C Code 


Yes 
Complete 
No 


Refine C code 
Phase 2: 
Refine C Code : 


Yes 
No 


Yes More C 
optimization? 


-—— Write linear assembly 
Phase 3: 


Write Linear 
Assembly Assembly optimize 


No 
: <Enal> 
Yes 


Complete 


Introduction 1-3 


Part I 


Code Development Flow to Increase Performance 


The following lists the phases in the 3-step software development flow shown 
on page 1-3, and the goal for each phase: 


Phase Goal 


1 You can develop your C code for phase 1 without any knowledge of 
the ’C6x. Use the ’C6x profiling tools that are described in the 
TMS320C6x C Source Debugger User’s Guide to identify any ineffi- 
cient areas that you might have in your C code. To improve the per- 
formance of your code, proceed to phase 2. 
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2 Use the intrinsics, shell options, and techniques that are described 
in Chapter 4 of this book to improve your C code. Use the ’C6x profil- 
ing tools to check its performance. If your code is still not as efficient 
as you would like it to be, proceed to phase 3. 


3 Extract the time-critical areas from your C code and rewrite the code 
in linear assembly. You can use the assembly optimizer to optimize 
this code. 


Chapter 2 


Code Development Flow Tutorial 


Part I 


This chapter walks you through the code development flow that was 
introduced in Chapter 1. It uses step-by-step instructions and code examples 
to show you how to use the software developmenttools in each phase of devel- 
opment. 


Before you start this tutorial, you should install the code generation tools and 
the C source debugger. If you do not have a Texas Instruments C source de- 
bugger, use your own debugger to check your results. 


The sample code that is used in this tutorial is included on the code generation 
tools CD-ROM. When you install your code generation tools, the example 
code is installed in the c6xtools directory. Use the code in that directory to go 
through the examples in this chapter. 


The examples in this chapter were run on the most recent version of the soft- 
ware development tools that were available as of the publication of this book. 
Because the tools are being continuously improved, you may get different re- 
sults if you are using a more recent version of the tools. 
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Before You Begin 


2.1 Before You Begin 


This tutorial contains three basic types of information: 


Primary tasks 
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Important information 


Optional tasks 


Primary tasks identify the main lessons in the 
tutorial; they are boxed so that you can find 
them easily. A primary task looks like this: 


On a command line, enter: 


load6x count.out 


In addition to primary actions, important infor- 
mation ensures that the tutorial works correctly. 
Important information is marked like this: 


| Important! | If you are using SunOS, be sure 


you reinitialize your shell before continuing with 
this tutorial. 


Optional tasks allow you to learn more about 
the ’C6x tools; however, you do not need to per- 
form the optional tasks to complete the tutorial 
successfully. Optional tasks are marked like 
this: 


Try This: | The stand-alone simulator (load6x) 
is another tool that you can use to find out what 
the cycle count for each function is. 


This tutorial is divided into lessons. Each lesson builds on the previous lesson. 
To get the most benefit from the tutorial, you should start at the beginning and 
work your way through each lesson in order to the end. 
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Introduction to the Example Code 


2.2 Introduction to the Example Code 
The C code example that you will use to start this tutorial is demo1.c, which 


is shown in Example 2—1. This example calls three functions: mac1(), 
vec_mpy1(), and iir1( ). 


Example 2-1. The Code Example—demo1.c 


main(int argc, char *argv[]) 


{ 
const short coefs[150]; 
short optr[150]; 
short state[2]; 
const short alls0]; 
const short bl 150]: 
int c = 0; 
int dotp[1] = {0}; 
int sum= 0; 
short y[150]; 
short scalar = 3345; 
const short x[150]; 


sum = macl(a, b, c, dotp); 
vec_mpyl(y, x, scalar); 
iirl(coefs, x, optr, state); 


The mac1( ) function, a multiply accumulate and squaring accumulate exam- 
ple, is shown in Example 2-2. It is performing a dot product of vector a with 
vector b and is also squaring and summing vector b. 


Example 2-2. The Multiply Accumulate Function—mac1.c 


int macl(const short *a, const short *b, int sqr, int *sum) 
{ 

int. 14 

int dotp = *sum; 


for (i = 0; i < 150; i++) 
{ 
dotp += b[i] * afil; 
sqr += b[i] * b[il; 
} 


*sum = dotp; 
Yrelurn Sqr; 


Code Development Flow Tutorial 2-3 


Part I 


Part I 


Introduction to the Example Code 


The vec_mpy( ) function shown in Example 2-3 is a vector multiply, which is 
a scalar multiply followed by a right shift. The result is stored to a second vec- 


tor. 


Example 2-3. The Vector Multiply Function—vec_mpy1.c 


void vec_mpyl(short y[], const short x[], 
int a; 


i= 0; i < 150; itt) 
[ += ((scalar * x[i]) >> 15); 


short scalar) 


The third function, iir1( ), is atypical infinite impulse response (IIR) biquad filter. 


The code for this function is shown in Example 2-4. 


Example 2-4. The Biquad Filter—iir1.c 


void iirl(const short *coefs, const short 
short *optr, short *state) 
{ 
short x; 
short. t; 
int n; 


x = input[0]; 


for (n = Oy nm < 50; n++) 


{ 


t = x + ((coefs[2] * state[0] 
coefs[3] * state[1]) 

x =t ((coefs[0] * state[0] 
coefs[1] * state[1]) 

state[1] = state[0]; 

state[0] = t; 

coefs 4; 

state = 2; 


*input, 


>> 15) 3 


>> 15); 


/* point to next filter coefs 
/* point to next filter states */ 


af 
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2.3 Lesson 1: Compiling, Assembling, and Linking the Example Code 


The first step is to compile, assemble, and link the code. 


Compiling for the ’C62x: 
On a command line, enter the following on a single line: 


cl6x -g -o -k -mg demol.c macl.c vec_mpyl.c iirl.c 
-z Ink.cmd -1 rts6201.1ib -o demol.out 


Compiling for the ’C67x: 
On a command line, enter the following on a single line: 


cl6x -g -o -k -mg -mv6700 demol.c macl.c vec_mpyl.c 
iirl.c -z 1Iink.cmd -1 rts6701.1ib -o demol.out 


You should not receive any errors, and the file, demo1.out, should be created. 
If you receive an error message, look up that error message in the appropriate 
user’s guide. 


Here is a description of what you told the shell program (cl6x) to do: 


cl6x Run the compiler and the assembler. 

-g Generate symbolic debugging directives that are used by 
the debugger. 

-0 Invoke the optimizer at the default level (—o is the same as 
—-02). 


Not all optimizations work well with debugging because the 
optimizer’s rearrangement of code can make it difficult for 
you to correlate source code with object code. Using the —g 
option with the —o option allows for the maximum amount 
of optimization that is compatible with debugging. 


—-k Keep the assembly output files. Notice that you now have 
the following .asm files in your current directory: 
demo1.asm, mac1.asm, vec_mpy1.asm, and iir1.asm. 


When the -k option is not used, the shell program deletes 
the assembly output files after assembly is complete. 


—mg Turn on the maximum amount of optimization that is com- 
patible with profiling. The —mg option allows you to profile 
optimized code. 


—mv6700 Compiler is invoked to target ’'C67x devices. 


If this switch is not used, the compiler defaults to the ’C62x 
device. This code will run on a ’C67x device, but it will run 
slower if using floating-point instructions since the code will 
have been compiled for the ’C62x device. 


Code Development Flow Tutorial 2-5 


Part I 


Part I 


Lesson 1: Compiling, Assembling, and Linking the Example Code 


2-6 


=Z 


Ink.cmd 


—| rts6201.lib 


—| rts6701.lib 


—o demo1.out 


Try This: 


Invoke the linker. The addition of this option to the cl6x com- 
mand line means that the code is compiled, assembled, 
and linked in one step. 


Use Ink.cmd as the linker command file. Linker command 
files allow you to put linking information into a file, which is 
useful when you invoke the linker often with the same in- 
formation. 


Linker command files are also useful because they allow 
you to use the MEMORY directive, which defines the target 
memory configuration, and the SECTIONS directive, which 
controls how sections are built and allocated. 


Include the runtime-support library for the ‘C62x device, 
rts6201.lib, which is included on your CD-ROM. 


The runtime-support functions in rts6201 .lib were compiled 
for little-endian mode. For big-endian mode, use the run- 
time support functions in rts6201e.lib. 


Include the runtime-support library for the 'C67x device, 
rts6701.lib, which is included on your CD-ROM. 


The runtime-support functions in rts6701 .lib were compiled 
for little-endian mode. For big-endian mode, use the run- 
time support functions in rts6701e.lib. 


Name the output file demo1.out. (The default is a.out.) 


Because this option comes after the —z option, it is consid- 
ered a linker option and is interpreted differently than the —o 
option that you entered before —z. 


The options above are used throughout the rest of this tutorial. 
They are fairly common and might be ones that you want to use repeatedly. 
To avoid having to retype them each time you run the code development tools, 
you can use the C_OPTIONS environment variable. The shell program uses 
the default options and/or input filenames that you name with the C_OPTIONS 
environment variable every time you run the shell. 


Use the commands in Table 2—1 to set up the C_OPTIONS environment vari- 
able with the options used on page 2-5. 
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Table 2-1. Using the C_OPTIONS Environment Variable 


‘YourSetup .-—s WhattoChange Command s—(‘“‘ié‘SOSO™S*S*~*~*~*~C~™ 
Windows NT™ System applet SET C_OPTION=~g —o —k —mg —z Ink.cmd —| rts6201.lib 
Windows™ 95 autoexec.bat SET C_OPTION=~g —o0 —k —mg —z Ink.cmd —1 rts6201.lib 
C shell .cshrc setenv C_OPTION ”*~g -0 —k —mg ~z Ink.cmd —| rts6201.lib” 

Bourne or Korn shell _ .profile setenv C_OPTION *~g -0 —k —mg ~z Ink.cmd —I rts6201.lib” 


Notice that the —-o demo1.out linker option was not included. If it were included, 
running the second tutorial example, demo2.c, would result in an output file 
named demo1.out instead of a more logical name such as demo2.out. 


Files must be explicitly called on command and not as an environment vari- 
able. To compile all of the C files in a directory, use the cl6x command with the 
appropriate options and use *.c where the files are normally indicated. For ex- 
ample: 


cl6x -g -mg *.c -z Ink.cmd -1 rts6201.lib -o demol.out 


| Important! | If you are using SunOS, be sure you reinitialize your shell before 
continuing with this tutorial: 


(j For C shells, enter the following on a command line: 


source ~/.cshrec 


_j For Bourne or Korn shells, enter the following on a command line: 


source ~/.profile 
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2.4 Lesson 2: Profiling the Example Code 


Now, use the profiler to look at the output of demo1. In this lesson, you will use 
the profiler to see the total execution time in number of cycles of each C func- 
tion in demo1.out. 


To start the profiler and load demo1.out, follow these steps: 


Part I 


Double-click the icon for the debugger. 


From the Profile menu, select Profile Mode. 


The debugger switches to profiling mode and displays only the Com- 
mand, Disassembly, File, and Profile windows. 


From the File menu, select Load Program. 


This displays the Load Program File dialog box. 


Double-click the demo1.out file. To do so, you might need to change the 
working directory. 


This loads demo1.out into the profiler. Because the File window is re- 
served for C programs, it disappears. 
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To select the areas of demo1 that you want profiled, follow these steps: 
From the Profile menu, select Select Areas. 
This displays the Profile Marking dialog box. 
In the Level box, select C. 
In the Area box, select Functions. 


This indicates that the C functions in demo1.out will be your profile 
areas. 


Click Mark. 


Profile Marking | . || |X| 


r Area Marking 
Level Area 


@c © Lines, Start 
* 
© Assembly © Ranges. Star! End 


© Both @ Functions 
© All areas 


Module: was ~| Enable | 
Function: [Nia +] Unmark | Disable | 


Close | Help | 


5) Click Close. 


The Profile window is updated to include a line for each C function in 
demo1. 
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To start the profiling session, follow these steps: 


1) Click the run icon on the toolbar: 


This displays the Profile Run dialog box. 


Part I 


2) Inthe Run Method box, select Quick, no exclusive fields. This will show 
you the total execution time (cycle count) of a profile area, including the 
execution time of any subroutines called within the functions. 


3) If main() is not already selected as your starting point, choose it from 
the list of starting points. 


Profile Run x | 


Run Method 


resume, [-) Clear data 


Often Never 
Display Rate: ee Te ee eC 


Start Point: [rnain + 


Cancel | Help | 


4) Click OK. 


The Run Method dialog box closes and the status bar reads Target: 
Profiling to indicate that the profiling session has started. 
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The program restarts and runs to main( ) without profiling. Profiling begins 
when main( ) is reached and continues until the exit point of main( ) is reached. 
When profiling is complete, the status bar reads Target: Haltedand your Profile 
window looks like this: 


fom) Profile 


C Function 
C Function 
C Function 


C Function 


Area Name Count Inclusive Incl—Max 


iirl()} 
maci( } 
maint } 
vec_mpy1()} 


The Inclusive column indicates the cycle counts for each function, including 
any function that it calls. Because these functions do not call any other func- 
tions, the inclusive cycle counts are the same as the exclusive cycle counts. 
Notice that the cycle count for the mac1() function is 167, and that the cycle 
counts for the vec_mpy1() and iir1() functions are much higher—316 and 
270, respectively. 


To interpret the cycle counts in the Profile window, you need to understand how 
they are calculated. Here is the formula for calculating cycle counts: 


Execute packets x loop iterations in C code + constant 


An execute packet is a group of parallel instructions. You can have up to eight 
instructions executing in parallel; therefore, each execute packet can contain 
up to eight instructions. An example of execute packets is shown in 
Example 2-7 on page 2-15. 


Table 2—2 shows how the cycle counts were calculated for each function. 


Table 2-2. Cycle Counts 


Function 
maci( ) 
vec_mpy1() 
iir1() 


Execute Packets Loop Iterations Constant Cycle Count 
1 150 17 1 x 150+ 17 = 167 
2 150 16 2 x 150+ 16 =316 
5 50 20 5 x 50+ 20 = 270 
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Try This: 


The stand-alone simulator (load6x) is another tool that you can use 


to find out 


what the cycle count for each function is. To get cycle count informa- 


tion for each function with the stand-alone simulator, embed the clock( ) func- 


tion in you 


r C code. Example 2-5 shows how to rewrite demo1.c to include the 


clock( ) function. 


Example 2-5. Including the clock( ) Function in demo1.c (count.c) 


#include <stdio.h> 
#include <time.h> 
main(int argc, char *argv[]) 


{ 


const short coefs[150]; 


short optr[150]; 

short state[2]; 

const short a[150]; 

const short b[150]; 

int c = 0; 

int dotp[1] = {0}; 

int sum= 0; 

short y[150]; 

short scalar = 3345; 

const short x[150]; 

clock_t start, stop, overhead; 

start = clock(); 

stop = clock(); 

overhead = stop start; 

start = clock(); 

sum = macl(a, b, c, dotp); 

stop = clock(); 

printf (”macl cycles: %d\n”, stop -— start - overhead); 
start = clock(); 

vec_mpyl(y, x, scalar); 

stop = clock(); 

printf (”vec_mpyl cycles: %d\n”, stop - start - overhead) ; 
start = clock(); 

iirl(coefs, x, optr, state); 

stop = clock(); 

printf (”iirl cycles: %d\n”, stop - start - overhead); 


a | 


Note: 


When using this method, remember to calculate the overhead and include 


the appropriate header files. 


| 
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Now, compile, assemble, and link count.c. 


If you did not set up your C_OPTIONS environment variable as described 
on page 2-6, enter the following on a command line: 


cl6x -g -o -k -mg count.c macl.c vec_mpyl.c iirl.c 
-z Ink.cmd -1 rts6201.1ib -o count.out 


OR 


If you set up your C_OPTIONS environment variable as described on 
page 2-6, enter the following on a command line: 


cl6x -z -o count.out 


Although the —z option is already specified in the C_OPTIONS environment 
variable, you need to specify it on the command line to indicate that this oc- 
currence of —0 is a linker option. 


Use load6x to see the output of the printf statements that were embedded in 
the C code. 


On a command line, enter: 


load6x count.out 


You should see the following output: 


TMS320C6x C I/O COFF Loader Version 1.01 
Copyright (c) 1989-1997 Texas Instruments Incorporated 
Interrupt to abort 

macl cycles: 175 

vec_mpyl cycles: 324 

iirl cycles: 278 

NORMAL COMPLETION: 20949 cycles 


Notice that these cycle counts are higher than the cycle counts that you saw 
with the profiler. For example, mac’ is listed here as having 175 cycles; howev- 
er, it was listed in the Profiler window as having 167 cycles. You will see some 
extra cycles when you use load6x because you still have overhead for each 
function call. When you use the profiler, the cycles needed for calling the func- 
tions are not included in the profile display. 


The Using the Stand-Alone Simulator chapter in the TMS320C6x Optimizing 
C Compiler User’s Guide discusses load6x in more detail. 
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2.5 Lesson 3: Phase 1 of the Code Development Flow 


Part I 


Looking at the functions in demo1 one at a time, you can determine whether 
or not they need to be improved and, if they do need to be improved, how they 
can be improved. Start by looking at the first function, mac1( ). 


Example 2—6 shows the assembly output of the function’s inner loop kernel. 
The loop kernel is the area of the loop with the most parallelism. Only the inner 
loop is shown, because this is the area that can be improved with software pi- 
pelining. Notice that there are eight instructions executing in parallel (as indi- 
cated by the seven sets of parallel bars). This is the maximum number of 
instructions that the ’C6x can execute in parallel, so this code does not need 
to be improved. 


Example 2-6. Inner Loop Kernel of mac1.asm 


L3: ; PIPED LOOP KERNEL 
ADD .L2 B4,B7,B7 ; 

|| ADD ll A5,A3,A3 ; 

|| MPY .M2X A4,B5,B4 7@@ 

|| MPY .M1 A4,A4,A5 7@@ 

|| [ BO] B .S1 L3 7 @@@e@ 

|| [ Bo] SUB wS2 BO,1,B0 7 @@CReE 

lal LDH DL *A0++,A4 7 @@CCEEE 

|| LDH D2 *B6++,B5 7 @@CCEEE 


The @ characters specify the iteration of the loop that an instruction is on in 
the software pipeline; these symbols are automatically created by the code 
generation tools. The first iteration does not have an @ character; one @ char- 
acter represents the second iteration; two @ characters represents the third 
iteration, and so on. 


Because the mac1( ) function does not need to be improved, it does not need 
to go beyond phase 1 of the code development flow. 
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Look at Example 2-7, which shows the assembly output of the innermost loop 
for the vec_mpy1( ) function. Recall from page 2-11 that the vec_mpy1( ) func- 
tion took 316 cycles to execute. This code is not as parallel as the mac1(_) func- 
tion. The assembly output for the vec_mpy1() function shows two execute 
packets. Each execute packet has four parallel instructions. This loop can be 
improved. 


Example 2-7. Inner Loop Kernel of vec_mpy1.asm 


Execute packets 


[— 


L3: ; PIPED LOOP KERNEL 

ADD .L2X A3,B6,B5 ; 

[ Al] B .S1 L3 7@@ 
LDH .D2 *+B4(6),B6 7@@@ 
LDH .D1 *RO++,A4 7 @@Ge 
STH £D2 B5, *B4++ ; 
SHR .S1 A3,15,A3 7@ 
MPY .M1 A5,A4,A3 7@@ 

[ Al] SUB .L1 Al,1,Al 7 @@Q@ 
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Example 2-8 shows the assembly output of the innermost loop for the iir( ) 
function. Recall from page 2-11 that the iir1() function took 270 cycles to 
execute. As you can see, some execute packets have five parallel instructions, 
while others have as few as four parallel instructions, which indicates that the 
code can probably be improved. 


Example 2-8. Inner Loop Kernel of iir1.asm 


2-16 


3:3 ; PIPED LOOP KERNEL 
SHR iS2 B4,15,B4 ; 
SHR od. A3,15,A5 ; 
MPY -M2X B6,A5,B6 7@ 
LDH D1 *+A6(16),A4 ;@@ 
LDH “D2 *+B7(10),B6 ;@@ 
ADD eelidll AO,A5,A0 ; 
MPY .M1X B6,A3,A3 7@ 
MPY .M2X B5,A4,B5 7@ 
LDH D1 *+A6(22),A3 ;@@ 
LDH ~D2 *+B7 (8) ,B5 7 @@ 
EXT “od. AO,16,16,A0 ; 
STH .D2 B5, *+B7 (6) 7@ 
PY .M1X B5,A3,A4 7@ 

LDH D1 *+A6(20),A3 ;@@ 
ADD sod 8,A6,A6 ; 
STH -D2 AO, *B7++(4) j; 
ADD .L1X A0,B4,A0 ; 

[ BO] SUB ~L2 BO,1,B0 7@ 
ADD Pasi’s B6,B5,B4 7@ 
EXT «S1 AO,16,16,A0 ; 

[ BO] B -S2 L3 7@ 
ADD -L1 A3,A4,A3 7@ 
LDH .D1 *+A6(18),A5 ;@@@ 


To improve the vec_mpy( ) andiir( ) functions, start by seeing how you can re- 
fine and improve your C code. This is what is referred to as phase 2 of the code 
development flow, and this is what the next lesson is about. 
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For your convenience, the vec_mpy1() function is duplicated here as 
Example 2-9 (the C version) and Example 2—10 (the assembly output of the 
inner loop). This is the same code that you saw in Example 2-3 and 
Example 2-7. 


Example 2-9. The Vector Multiply Function—vec_mpy1.c 


void vec_mpyl (short y[], const short x[], short scalar) 
{ 

int alive 

for 


i = 0; i < 150; i++) 
[ += ((scalar * x[i]) >> 15); 


Example 2-9 uses short data types. Because short data types are 16 bits, they 
translate into halfword instructions, such as LDH and STH (see 
Example 2-10). 


The loop in Example 2—10 uses two LDH instructions and an STH instruction 
to load x[i] and y[i] and store back to y[i]. Because only two memory operations 
can occur per cycle, the fastest that this loop can execute is one y[i] result ev- 
ery two cycles. The performance of this loop is limited by the number of D units. 


Example 2-10. Inner Loop Kernel of vec_mpy1.asm 


L3: ; PIPED LOOP KERNEL 
ADD .L2X A3,B6,B5 ; 
[ Al] B .S1 L3 7@@ 
LDH .D2 *+B4(6),B6 7@@@ 
LDH .D1 *RO++,A4 7 @@Ge 
STH .D2 B5, *B4++ ; 
SHR s1 A3,15,A3 7@ 
MPY .M1 A5,A4,A3 7@@ 
[ Al] SUB L Al,1,Al ;@@@ 


Because x is an array, x[i] and x[i + 1] are next to each other in memory. This 
means that instead of using halfword instructions (LDH and STH) to load and 
store each elementin the array, you can use word instructions (LDW and STW) 
to load and store two elements at a time, as long as the data is aligned on a 
word boundary. In other words, all word accesses should have the 2 LSBs of 
the address set to 0. Two elements at a time, x[i] and x[i + 1], fit into one 32-bit 
register. 
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To achieve this in C, declare x[ ] as an integer instead of as a short data type. 
Also, you need to use some intrinsics. 


Now that you have determined that you can load x{i] and x[i + 1] into the same 
register, you need to figure out how to do it. You can do this by using the _mpy 
and_mpylh intrinsics. Intrinsics are like built-in C functions that correspond to 
’C6x assembly language instructions. The _mpy intrinsic multiplies the 
16 LSBs of one operand by the 16 LSBs of another and returns the result. The 
_mpylh intrinsic multiplies the 16 LSBs of the first operand by the 16 MSBs of 
the second and returns the result. 


You can then use the _add2 intrinsic to add the 16 MSBs of the first operand 
to the 16 MSBs of the second operand. At the same time, the _add2 intrinsic 
also adds the 16 LSBs of the first operand to the 16 LSBs of the second oper- 
and. The result of both additions is stored in a 32-bit operand. 


MSBs LSBs 
+ + 
MSBs LSBs 


MSBs LSBs 


Example 2-11 shows how to rewrite the vec_mpy() function to include the 
_mpy and _mpylh intrinsics: 


Example 2-11. The Revised Vector Multiply Function—vec_mpy2.c 


void vec_mpy2(int y[], const int x[], short scalar) 
{ 

int i, val; 

unsigned int temph, templ; 


for (1 = 07 1 < 75} a++) 


{ 


val = x[il; 

templ = (_mpy (scalar, val) >> 15) & Ox0000ffff; 
temph = (_mpylh(scalar, val) << 1) & Oxffff0000; 
y[il = _add2(y[i], temph | templ); 


} 
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Now, look at the iir1( ) function. Example 2-12 shows the same code that you 
saw in Example 2-4. 


Example 2-12. The Biquad Filter—iir1.c 


void iirl(const short *coefs, const short *input, 
short *optr, short *state) 


{ 


short pe 
short ey 
int nN; 
x = input[0]; 


t = x + ((coefs[2] * state[0O] + 
coefs[3] * state[1]) >> 15); 


x = t + ((coefs[0] * state[0] + 
coefs[1] * state[1]) >> 15); 


state[1] state[0]; 

state[0] =t; 

coefs t= 4; /* point to next filter coefs */ 
state += 2; /* point to next filter states */ 


} 


*optrt+t+ = x; 
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You can improve the iir( ) function by using the same methods that you used 


to improve the vec_mpy( ) function. Example 2-13 shows how to rewrite the 
iir(.) function: 


Example 2-13. The Revised Biquad Filter—iir2.c 


2-20 


void iir2(const int *coefs, const short *input, 
short *optr, short *state) 
{ 


short x 
short ts 
int Nn; 


for (n = 0; n < 50; n++) 
{ 
t= x+((_mpy(coefs[1],state[0]) + 
_mpyhl(coefs[1],state[1])) >> 15); 
x= t+((_mpy(coefs[0],state[0]) 
_mpyhl(coefs[0],state[1])) >> 15); 
state[1] = state[0]; 
state[0] = t; 
coefs += 2; 
state += 2; 


*optr+t+ = x; 
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Using demo2.c, shown in Example 2—14, run the revised functions through the 
compiler, assembler, and linker. 


Example 2-14. The Revised Example—demo2.c 


main(int argc, char *argv[]) 


{ 
const short coefs[100]; 
short optr[100]; 
short state[2]; 
const short a[100]; 
const short bILO0]. 
int c = 0; 
int dotp[1] = {0}; 
int sum= 0; 
short y[100]; 
short scalar = 3345; 
const short x[100]; 


sum = macl(a, b, c, dotp); 
vec_mpy2(y, x, scalar); 
iir2(coefs, x, optr, state); 


If you did not set up your C_OPTIONS environment variable as described 
on page 2-6, enter the following on a command line: 


cl6x -g -o -k -mg demo2.c macl.c vec_mpy2.c iir2.c 
-z lnk.cmd -1 rts6201.1ib -o demo2.out 


OR 


If you set up your C_OPTIONS environment variable as described on 


page 2-6, enter the following on a command line: 
cl6x -z -o demo2.out 


Although the —z option is already specified in the C_OPTIONS environment 
variable, you need to specify it on the command line to indicate that this oc- 
currence of —0 is a linker option. 
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The inner loop of the vec_mpy2( ) function translates into the assembly output 
shown in Example 2-15. 


Example 2-15. Inner Loop Kernel of vec_mpy2.asm 


L3: ; PIPED LOOP KERNEL 

OR 2L2k B5,A8,B7 7@ 
SHL SL A6é,1,A4 7 @@ 

[ Al] B $2 L3 7@@ 
AND pital A5,A4,A6 7@@ 
LDW .D2 *+B4(12),B5 ;@@@ 
MPYLH .M1 AO,A9,A6 ;@@e@ 
LDW .D1 *A3++,A9 7 AAR 
STW .D2 B6, *B4++ ; 
ADD2 “52 B5,B7,B6 7@ 
AND Gi AT7,A4,A8 7@@ 
MV 2X A6,B5 7@@ 

[ Al] SUB .D1 Al,1,Al ;@@Q@ 
SHR .S1 A8,15,A4 7@@@ 
MPY .M1 AO,A9,A8 7 @@@@ 


As you can see, the code for the vec_mpy2( ) function is improved over the 
original vec_mpy() code. Two LDW instructions are loading four elements 
(x{i], x{i+1], y[i], and y[i+1]), and one STW instruction is storing two elements: 
x[i] and y[i+1]. With the revised code, two y[i] results are stored every two 
cycles. Recall that only one y[i] result was stored every two cycles in 
Example 2-10. 


Table 2-3 shows how the vec_mpy( ) function has improved as it moves from 
phase 1 to phase 2. 


Table 2-3. Revised Cycle Counts for vec_mpy( ) 


Function Execute Packets Loop Iterations Constant Cycle Count 
vec_mpy1( ) 2 150 16 2 x 150+ 16 = 316 
vec_mpy2( ) 2 75 22 2x 75+22=172 
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Now, look at the inner loop of the third function, iir( ). Example 2-16 shows the 
assembly output of the innermost loop for the revised iir( ) function: 


Example 2-16. Inner Loop Kernel of iir2.asm 


L3: ; PIPED LOOP KERNEL 
ADD 22 B7,B8,B7 ; 
ADD Rafal A0,A3,A0 ; 
MV .S2 B6,B9 7@ 
STH Dd. A5, *+A4 (6) 7@ 
LDW ~D2 *B5++(8),B8 ;@@ 
SHR .S2 B7,15,B7 ; 
EXT «ol AO,16,16,A0 ; 
[ BO] SUB ~L2 BO,1,BO 7@ 
PY .M2X B8,A5,B8 7@ 
ADD Paibalp.¢ B6,A3,A3 7@ 
LDH .D2 *+B4(14),B6 ;@@@ 
ADD .L1X AO,B7,A6 ; 
PYHL .M2 B8,B9,B7 7@ 
SHR St A3,15,A3 7@ 
[ BO] B 252 L3 7@ 
LDW D2 *+B5 (4),B7 7 @@@ 
LDH .D1 *+R4(12),A5 ;@@@ 
ADD .L2 4,B4,B4 ; 
STH .D1 AO, *A4++(4) ; 
EXT Poul A6,16,16,A0 ; 
PYHL .M2 B7,B6,B6 7@@ 
PY .MLX B7,A5,A3 7@@ 


Table 2—4 shows how the iir( ) function has improved. Now, the code has only 
four execute packets; however, each packet has only five or six parallel 
instructions, which could be probably improved. 


Table 2-4. Revised Cycle Counts for iir( ) 


Function Execute Packets Loop Iterations Constant Cycle Count 
iir1() 5 50 20 5 x 50+ 20 = 270 
iir2() 4 50 20 4 x 50 + 20 = 220 
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Use the profiler to view the cycle counts of the revised functions. 


Your profile window should look like this: 
Fim) Profile |. |O} x} 
Area Nane 


Count Inclusive Incl—Max 


C Function vec_mpy2() 


ae 1 172 172 
t C Function iir2() 1 220 220 
c C Function maci{} 1 16? 167 
Qa C Function wmain{) 1 637 637 


Notice that the cycle count for the second function, the vector multiply, is down 
from 316 to 172. The IIR filter has improved also: it is down from 270 to 220. 
However, the cycle count for the IIR filter is still too high. Naturally, the cycle 
count for main( ) has decreased also. It is down from 831 to 637. 


Table 2-5. Revised Cycle Counts 


Function Execute Packets Loop Iterations Constant Cycle Count 
mact()t 1 150 17 1 x 150 +17 = 167 
vec_mpy2( ) 2 75 22 2x 75+22=172 
iir2( ) 4 50 20 4 x 50 + 20 = 220 


T The cycle count for the mac1( ) function has not changed. 


You have done everything you can to refine the C code in the iir( ) function. To 


improve your code at this point, you need to use the assembly optimizer. This 
leads you to phase 3 of the code development flow. 
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2.7 Lesson 5: Phase 3 of the Code Development Flow 


To further improve the iir( ) function, you will need to rewrite it in linear assem- 
bly. Linear assembly is the input for the assembly optimizer. 


Linear assembly is similar to regular ’C6x assembly code in that you use ’C6x 
instructions to write your code. With linear assembly, however, you do not need 
to specify all of the information that you need to specify in regular ’C6x assem- 
bly code. With linear assembly code, you have the option of specifying the in- 
formation or letting the assembly optimizer specify it for you. Here is the in- 
formation that you do not need to specify in linear assembly code: 


(j Parallel instructions 

Lj Pipeline latency 

Lj Register usage 

Lj Which functional unit is being used 


If you choose not to specify these things, the assembly optimizer determines 
the information that you do not include, based on the information that it has 
about your code. As with other code generation tools, you might need to modify 
your linear assembly code until you are satisfied with its performance. When 
you do this, you will probably want to add more detail to your linear assembly. 
For example, you might want to specify which functional unit should be used. 


Before you use the assembly optimizer, you need to know the following things 
about how it works: 


L1 A linear assembly file must be specified with a .sa extension. 


_j Linear assembly code should include the .cproc and .endproc directives. 
The .cproc and .endproc directives delimit a section of your code that you 
want the assembly optimizer to optimize. Use .cproc at the beginning of 
the section and .endproc at the end of the section. In this way, you can set 
off sections of your assembly code that you want to be optimized, like pro- 
cedures or functions. 


(j Linear assembly code may include a .reg directive. The .reg directive al- 
lows you to use descriptive names for values that will be stored in regis- 
ters. When you use .reg, the assembly optimizer chooses a register whose 
use agrees with the functional units chosen for the instructions that oper- 
ate on the value. 


(1 Linear assembly code may include a .irip directive. The .trip directive 
specifies the value of the trip count. The trip count indicates how many 
times a loop will iterate. 


Now that you have some information about the fundamentals of linear assem- 
bly code, look at the revised C code for the biquad filter again. Example 2-17 
shows the same code that you saw in Example 2-13 on page 2-20. 
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Example 2-17. The Revised Biquad Filter—iir2.c 


void iir2(const int *coefs, const short *input, 
short *optr, short *state) 


{ 


short x; 
short ce 
int Nn; 


x = input[0]; 
for (n = 0; n < 50; n++) 


t= x+((_mpy(coefs[1],state[0]) + 
_mpyhl(coefs[1],state[1])) >> 15); 


x= t+((_mpy(coefs[0],state[0]) + 
_mpyhl(coefs[0],state[1])) >> 15); 


state[1] = state[0]; 
state[0] = t; 


coefs += 2; 
state += 2; 


Example 2—18 shows how to rewrite the iir( ) function in linear assembly. 
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Example 2-18. The Biquad Filter, Revised and Assembly-Optimized—iir3.sa 


_Lix3 


LOOP: 


[etei 
[etr] 


.def 


.cCproc 


LES 


-reg cptri1, s0Ol, 


-reg po, 


MV 
MV 


MVK 
serap: 50 


LDW 
LDW 
LDW 
MV 


MPY 
MPYH 
ADD 
SHR 
ADD 
AND 


MPY 
MP YH 
ADD 
SHR 
ADD 


SHL 
OR 
STW 


ADD 
B 


-endproc 


cptr0,sptrod 


sl0, 623, c10, 


pl, p2, p3, s23_s, sl, 

ae cptr0,cptrl 

~l sptr0,sptrl 

50,etr 

D1T1 *eptr0++ [2] ,c32 

D2T2 *cptrlt++[2],c10 

.D1T2 *sptr0,sl10 

ie s10,sl0p 

.M1 e32,.s10, 02 

-M1 e332; s10; 03. 
p2;p3;:s23 
623,15, S823_8 
s23_5,x%,t 
t,mask,t 

.M2 c10,s10,p0 

.M2 c10,s10,pl 

ne po, pl,;s10_t 

eZ s10_t,15,sl10_s 

ne sl0_s,t,;x 

«2 sl10p,16,s1 

ne t,sl1,s01 

.D1 s01,*sptrl1t++ 

od =1,,clr,; Ccr 

Pol LOOP 


e32, sil0_s;, sl0_t 


t, 


x, mask, sptrl, sl0Op, ctr 


setup loop counter 


coefAddr[3] & CoefAddr[2] 
CoefAddr[1] & CoefAddr[0] 
StateAddr[1] & StateAddr [0] 

save StateAddr[1] & StateAddr[0] 


CoefAddr[2] * StateAddr[0] 

CoefAddr[3] * StateAddr[1] 

CA[2] * SA[O] + CA[3] * SA[1] 

(CA[2] * SA[0] + CA[3] * SA[1]) >> 15 
t = x+((CA[2]*SA[0]+CA[3]*SA[1])>>15) 
clear upper 16 bits 


CoefAddr[0] * StateAddr [0] 
CoefAddr[1] * StateAddr[1] 

CA[O] * SA[O] + CA[1] * SA[1] 

(CA[O] * SA[O] + CA[1] * SA[1]) >> 15 
x = t+((CA[0]*SA[0]+CA[1]*SA[1])>>15) 


StateAddr[1] = StateAddr[0] 
StateAddr[0] =t 
store StateAddr[1] & StateAddr[0] 


dec outer lp cntr 
Branch outer loop 
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Using demo3.c, shown in Example 2—19, run the revised functions through the 
code generation tools. 


Example 2-19. The Revised Example—demo3.c 


Part I 
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main(int argc, char *argv[]) 


{ 
const short coefs[150]; 
short optr[150]; 
short state[2]; 
const short a[150]; 
const short b[150); 
int c = 0; 
int dotp[1] 
int sum = 0; 
short y[150]; 
short scalar = 3345; 
const short x[150]; 


{O}; 


sum = macl(a, b, c, dotp); 
vec_mpy2(y, x, scalar); 
iir3(coefs, x, optr, state); 


Use the shell program (cl6x) to compile, assemble, and link. Be sure you use 
the —mg option. The —mg option ensures that the optimizations that are used 
are compatible with profiling. 


On a command line, enter: 


cl6x -g -o -k -mg demo3.c macl.c vec_mpy2.c iir3.sa 
-z 1lnk.cmd -1 rts6201.1ib -o demo3.out 


Notice that you used the shell program to compile a linear assembly file and 
aC file at the same time. Also notice that (except for the —mg option) you used 
the same options that you used in the first part of this tutorial. The assembly 
optimizer has a small set of some unique options, but many of the options that 
you will use are shell options that apply to either linear assembly files or C files. 
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Example 2-20. Inner Loop Kernel of iir3.asm 


L3: 


[ Al] 


[ Al] 


ED LOOP KERNEL 
.-L2 B3,B7,B0 ; clear upper 16 bits 
.S2 BO,B8,B8 7@ CA[O] * SA[O] + CA[1] * SA[1] 
»S1 L3 7@ Branch outer loop 
-L1 A4,A5,A4 7@ CA[2] * SA[O] + CA[3] * SA[1] 
.M2 B2,B1,B8 ;@@ CoefAddr[1] * StateAddr[1] 
-M1X A0O,B1,A4 ;@@ CoefAddr[2] * StateAddr[0] 
.D2 *B6++(8),B2 ;@@@@ CoefAddr[1] & CoefAddr [0 
«DL *A3++(8),A0 ;@@@@ coefAddr[3] & CoefAddr[2 
.D2 B4,B0,B9 ; & = t+((CA[0]*SA[0]+CA[1]*SA[1])>>15) 
.L2 BO,B9,BO ; StateAddr[0] =t 
.S2 B8,0xf,B4 7;@ (CA[O] * SA[O] + CA[1] * SA[1]) >> 15 
Sal A4,0xf,A5 7@ (CA[2] * SA[O] + CA[3] * SA[1]) >> 15 
-M2 B2,B1,B0O ;@@ CoefAddr[0] * StateAddr[0] 
-M1X A0Q,B1,A5 ;@@ CoefAddr[3] * StateAddr[1] 
D1 *A6++,B1 ;@@@@ StateAddr[1] & StateAddr [0 
-D1 BO, *A7++ ; store StateAddr[1] & StateAddr[0] 
252 B5, 0x10,B9 ;@ StateAddr[1] = StateAddr[0] 
.L2X B9,A5,B3 7@ t = x+((CA[2]*SA[0]+CA[3]*SA[1])>>15) 
ol Oxffffffff,Al,Al ;@@ dec outer lp cntr 
«D2 Bl BS ;@@ save StateAddr[1] & StateAddr[0] 


Table 2-6 shows how the iir( ) function has improved as it has moved through 
the three phases of code development. 


Table 2-6. Revised Cycle Counts for iir( ) 


Function 
iir1() 
iir2() 
iir3() 


Execute Packets Loop Iterations Constant Cycle Count 
6 50 20 6 x 50+ 20 = 270 
4 50 20 4 x 50+ 20 = 220 
3 50 27 3 x 50+27=177 
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Use the profiler to view the cycle counts of the revised functions. 


Your profile window should look like this: 


fia) Profile |. |O} x! 
Area Name Count Inclusive Incl—Max 

eS C Function vec_mpy2()} 1 i7v2 172 

t C Function 2i73() 1 iv? 177? 

os C Function macil{) 1 167 167 

oO C Function main() 1 594 594 


Notice that the cycle count for the IIR filter has improved: it is down from 220 


to 177. Naturally, the cycle count for main( ) has decreased also. It is down 
from 637 to 594. 


Table 2-7. Revised Cycle Counts 


Function Execute Packets Loop Iterations Constant Cycle Count 
mact()t 1 150 17 1 x 150 +17 = 167 
vec_mpy2( )t 2 75 22 2x 75+ 22=172 
iir3( ) 3 50 27 


3 x 50+27=177 
tT The cycle count for the mac1( ) function and the vec_mpy( ) function have not changed. 


The Using the Assembly Optimizer chapter in the TMS320C6x Optimizing C 
Compiler User’s Guide discusses the assembly optimizer in more detail. 
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2.8 Summary 


Summary 


Congratulations! In this tutorial, you learned the following things: 


[J What the three phases of code development are, how to determine which 
phases are appropriate for improving different parts of your code, and how 
to write your code for each phase. 


.) What a linear assembly file is and some fundamental information on how 
to write one. 


1 How to use the code generation tools to compile, assemble, and link your 
C and linear assembly files. 


Lj How to use the profiler to analyze your results and determine whether or 
not you need to continue refining your code. 
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Chapter 3 


TMS320C6x Optimization Checklist 


Part I 


Because most ofthe millions of instructions per second (MIPS) in DSP applica- 
tions occur in tight loops, it is important for the ‘C6x code generation tools to 
make maximal use of all the hardware resources in important loops. Fortu- 
nately, loops inherently have more parallelism than non-looping code because 
there are multiple iterations of the same code executing with limited depen- 
dencies between each iteration. Through a technique called software pipelin- 
ing, the ’C6x code generation tools use the multiple resources of the VelociT| 
architecture efficiently and obtain very high performance. 


This chapter shows the code development flow recommended to achieve the 
highest performance on loops and provides a checklist that can be used to op- 
timize loops with references to more detailed documentation. 
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Table 3-1 describes the steps recommended for developing code to achieve 
the highest performance on loops. 


Table 3-1. Code Development Steps 
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Step __ Description 
1 Compile and profile native C code 


(4 Validates original C code 


[1 Determines which loops are most important in terms of MIPS require- 
ments 


2 Add const declarations and loop count information 
(J Reduces potential pointer aliasing problems 


.j Allows loops with indeterminate iteration counts to execute epilogs 


3 Optimize C code using intrinsics and other methods 
(1 Facilitates use of certain ’C6x instructions not easily represented in C 
(1 Optimizes data flow bandwidth 


4a Write linear assembly 
(j Allows control in determining exact ’C6x instructions to be used 


Lj Provides flexibility of hand-coded assembly without worry of pipelining, 
parallelism, or register allocation 


[J Can pass memory bank information to the tools 
4b Add partitioning information to the linear assembly 


(J Can improve partitioning of loops when necessary 


(j Can avoid bottlenecks of certain hardware resources 


When you achieve the desired performance in your code, there is no need to 
move to the next step. Each of the steps in the development involve passing 
more information to the ’C6x tools. Even at the final step, development time 
is greatly reduced from that of hand-coding, and the performance approaches 
the best that can be achieved by hand. 


Internal benchmarking efforts at Texas Instruments have shown that most 
loops achieve maximal throughput after steps 1 and 2. For loops that do not, 
the C compiler offers a rich set of optimizations that can fine tune all from the 
high level C language. For the few loops that need even further optimizations, 
the assembly optimizer gives the programmer more flexibility than C can offer, 
works within the framework of C, and is much like programming in higher level 
C. For more information on the assembly optimizer, see the TMS320C6x Opti- 
mizing C Compiler User’s Guide and Chapter 6, Optimizing Assembly Code 
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via Linear Assembly, in this book. For example, linear assembly files point to 
the demo directory included with the ’C6x tools. 


In order to aid the development process, a feedback option (—mw) is included 
in the code generation tools. Example 3-1 shows output from the compiler 
and/or assembly optimizer of a particular loop. See the TMS320C6x Optimiz- 
ing C Compiler User’s Guide for more information about the —mw option. 


Example 3—1. Compiler and/or Assembly Optimizer Feedback 


~ 


~ 
+ + + + F FF F F F FF F F FF F F F F F F F F F F F FF OF 


ee eT 


, 
, 
v4 
, 


Ne Ne Ne Ne 


Ne Ne Ne Ne Ne 


MeN) Ne Ne 


SOFTWARE PIPELINE INFORMATION 


Loop label : LOOP 

Loop Carried Dependency Bound 
Unpartitioned Resource Bound 
Partitioned Resource Bound (*) 
Resource Partition: 


A-sid 

units 

units 

units 

units 

cross paths 
.T address paths 
Long read paths 
Long write paths 
Logical ops (.LS) 
Additional ops (.LSD) 
Bound (.L .S .LS) 
Bound (.L .S .D .LS .LSD) 


* 


WNBORFRN AND LD 
NNRFPFPOONWNN ND 


ws 
* 


Searching for software pipeline schedul 


at 


ii = 4 Failed register allocation 
ii = 5 Schedule found with 4 

iterations in parallel 
Done 


This feedback is important in determining which optimizations might be useful 
for further improved performance. The following checklist is provided as a 
quick reference to techniques that can be used to optimize loops and refers 
to specific sections within this book for more detail. 
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Table 3-2. TMS320C6x Optimization Checklist 


Feedback Solutio 
Loop carried dependency | C Code 
bound is much larger than | ~ 


unpartitioned resource 
bound 


j~ 


an 
Partitioned resource a 
bound is higher than un- 
partitioned resource 
bound 
Too many instructions, or a 
inefficient instructions 
were generated by the ee 
compiler 
Failed to software pipeline | ~ 
due to register live-too- 
long 
Failed to software pipeline | ~ 
due to register allocation 

j~ 
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n 


Use —pm program level optimization 
to reduce memory pointer aliasing. 


Add const declarations to all pointers 
passed to a function that are read 
only. 


Use —mt option to assume no 
memory pointer aliasing. 


Linear assembly 


Make sure instructions accessing 
memory at the beginning of the loop 
do not use the same pointer variables 
as instructions accessing memory at 
the end of the loop. 


Write code in linear assembly with 
partitioning/functional unit informa- 
tion. 


Use intrinsics in C code to select 
more efficient 'C6x instructions. 


Write code in linear assembly to pick 
exact 'C6x instruction to be executed. 


Write linear assembly and insert MV 
instructions to split register lifetimes 
that are live-too-long. 


Try splitting the loop into two sepa- 
rate loops. 


If multiple conditionals are used in the 
loop, allocation of the condition regis- 
ters could be the reason for the fail- 
ure. Try writing linear assembly and 
partition all instructions, writing to 
condition registers evenly between 
the A and B sides of the machine. If 
there are an uneven number, put 
more on the B side, since there are 3 
condition registers on the B side and 
only 2 on the A side. 


For more information, 
refer to ... 


Performing Program- 
Level Optimization (-pm 
Option) 


The const Keyword 


Memory Dependencies 


Loop Carry Paths 


Linear Assembly Re- 
source Allocation 


Using Intrinsics 


TMS320C6x Optimizing 
C Compiler User’s Guide 


Split-Join-Path Problems 


Page 
# 


4-8 


4-6 


6-73 


6-19 


6-101 


Table 3-2. TMS320C6x Optimization Checklist (Continued) 
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For more information, Page 
Feedback refer to ... # 
T address paths are re- 
source bound Use word access for short arrays; de- | Using Word Access for 4-14 
clare int* and use mpy intrinsics to Short Data in Part Il 
multiply upper and lower halves of 
registers. 
Try to employ redundant load elimina- | Redundant Load Elimi- —_ 6-106 
tion technique if possible. nation 
Linear assembly 
Use LDW/STW instructions for ac- Using Word Access for 6-15 
cesses to memory. Short Data in Part III 
There are memory bank Write linear assembly and use the Loop Unrolling 6-90 
conflicts (specified in the .mptr directive. 
memory analysis window 
of simulator) 
Larger outer loop over- Unroll the inner loop. Loop Unrollingin Part ll 4-23, 
head in nested loop and Part III 6-90 
Make one loop with the outer loop in- | Outer Loop Conditionally 6-132 
structions conditional on an inner Executed With Inner 
loop counter. Loop 
Uneven resources (for ex- Unroll the loop to make an even num- | Loop Unrolling in Part IIl_ 6-90 
ample, 3 multiplies per ber of resources. 
loop iteration) 
Two loops are generated, Use the _nassert statement to specify | Communicating Trip- 4-22 
one not software pipelined loop count information. Count Information to the 
Compiler 
Two loops are generated, Use the .trip directive to specify loop | Lesson 5: Phase 3 of the 2-25 
one not software pipelined count information. Code Development Flow 
Loop will not software Make sure there are no function calls, |} What Disqualifies a Loop 4-26 
pipeline for other reasons branches to other code, or conditional | from Being Software-Pi- 
break statements in the loop. pelined 
. 4-2, 
Try making the loop counter down- Tips on Data Types and 4-24 
counting and declare it an int in C. Trip Count Issues 
Remove any modifications to the loop | What Disqualifies a Loop 
counter inside the loop. from Being Software-Pi- 4-26 
pelined 
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Part / 
Introduction 


Part Il 


C Code 


Part Il 
Assembly Code 


Part IV 


Appendix 
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Chapter 4 


Optimizing C Code 


You can maximize C performance by using compiler options, intrinsics, and 
code transformations. This chapter discusses the following topics: 


Lj The compiler and its options 
_j Intrinsics 

_j Software pipelining 

_j Loop unrolling 


Part Il 


Topic Page 


4.1 Writing C Code 
42° Compiling'G Code. «222222 c2cc0ccemnewerelndece csc cistee ce cists secs 4-4 
4.3 Refining C Code 
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4.1 Writing C Code 


This chapter shows you how to analyze and tailor your code to be sure you are 
getting the best performance from the ’C6x architecture. 


4.1.1 


Tips on Data Types 


Give careful consideration to the data type size when writing your code. The 
’C6x compiler defines a size for each data type (signed and unsigned): 


[J char 8 bits 

Lj short 16 bits 

CL) int 32 bits 

Li long 40 bits 

1 float 32 bits 

Lj double 64 bits 

Based on the size of each data type, follow these guidelines when writing C 
code: 

_j Avoid code that assumes that int and long types are the same size, 


Ly 


because the ’C6x compiler uses long values for 40-bit operations. 


Use the short data type for fixed-point multiplication inputs whenever 
possible because this data type provides the most efficient use of the 
16-bit multiplier in the ’C6x. 


Use int or unsigned int types for loop counters, rather than short or un- 
signed short data types, to avoid unnecessary sign-extension instructions. 


When using floating-point instructions on a floating-point device such as 
the ’C67x, use the -mv6700 compiler switch so the code generated will 
use the device’s floating-point hardware instead of performing the task 
with fixed point hardware. 


4.1.2 Analyzing C Code Performance 
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Use the following techniques to analyze the performance of specific code 
regions: 


L 


One of the preliminary measures of code is the time it takes the code to 
run. Use the clock( ) and printf( ) functions in C to time and display the 
performance of specific code regions. You can use the stand-alone simu- 
lator (load6x) to run the code for this purpose. 


Use the profile mode in the debugger, as explained in the TMS320C6x 
C Source Debugger User’s Guide, to collect execution statistics about 
specific areas in your code. 


Writing C Code 


(i Use breakpoints, the clk register, and the RUNB command in the 
debugger, as described in the TMS320C6x C Source Debugger User's 
Guide, to track the number of CPU clock cycles consumed by a particular 
section of code. 


(1 The critical performance areas in your code are most often loops. The 
easiest way to optimize a loop is by extracting it into a separate file that 
can be rewritten, recompiled, and run stand-alone. 


As you use the techniques described in this chapter to optimize your C code, 
you can then evaluate the performance results by running the code and 
looking at the instructions generated by the compiler. 
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4.2 Compiling C Code 
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The ’C6x compiler offers high-level language support by transforming your C 
code into more efficient assembly language source code. The compiler tools 
include a shell program (cl6x), which you use to compile, assembly optimize, 
assemble, and link programs in a single step. To invoke the compiler shell, en- 


ter: 


cl6x [options] [filenames] [-z [linker options] [object files]] 


For acomplete description of the C compiler and the options discussed in this 


chapter, see the TMS320C6x Optimizing C Compiler User’s Guide. 


Compiler Options 


Options control the operation of the compiler. Table 4—1 defines the options 


discussed in this chapter. 


Option 
—ot 


—pmt 


—mh <n> 


Table 4—1. Subset of Compiler Options 


Description 


Enables software pipelining and other optimizations in the com- 
piler 


Enables program-level optimization 


Enables the compiler to use assumptions that allow it to be 
more aggressive with certain optimizations 


Allows you to profile optimized code 


Ensures that redundant loops are not generated, thereby reduc- 
ing code size 


Keeps the assembly file so that you can inspect it 
Disables software pipelining 


Allows speculative execution 


T Although -03 is preferable, at a minimum use the —o option. 
+ Use the —pm option for as much of your program as possible. 


Compiling C Code 


4.2.2 Memory Dependencies 


To maximize the efficiency of your code, the ’C6x compiler schedules as many 
instructions as possible in parallel. To schedule instructions in parallel, the 
compiler must determine the relationships, or dependencies, between instruc- 
tions. Dependency means that one instruction must occur before another. 
Because only independent instructions can execute in parallel, dependencies 
inhibit parallelism. 


(J Ifthe compiler cannot determine that two instructions are independent (for 
example, b does not depend on a), it assumes a dependency and sched- 
ules the two instructions sequentially. 


(1 Ifthe compiler can determine that two instructions are independent of one 
another, it can schedule them in parallel. 


Often it is difficult for the compiler to determine if instructions that access 
memory are independent. The following techniques help the compiler deter- 
mine which instructions are independent: 


_j Use the const keyword to indicate which objects are not changed by a 
function. 


[1 Use the-pm (program-level optimization) option, which gives the compiler 
global access to the whole program or module and allows it to be more 
aggressive in ruling out dependencies. 


(1 Use the —mt option, which allows the compiler to use assumptions that al- 
low it to eliminate dependencies. 


To illustrate the concept of memory dependencies, it is helpful to look at the 
algorithm code in a dependency graph. Example 4—1 shows the C code for a 
basic vector sum. Figure 4—1 shows the dependency graph for this basic vec- 
tor sum. (For more information, see section 6.2.4, Drawing a Dependency 
Graph, on page 6-6.) 


Example 4—1. Basic Vector Sum 


void vecsum(short *sum, short *inl, short *in2, unsigned int N) 


ant. 2; 
for (i = 0; i < Nj; itt) 
sum[i] = inl[i] + in2[il]; 
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Figure 4—1. Dependency Graph for Vector Sum #1 
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The dependency graph in Figure 4—1 shows that: 


Lj The paths from sum[i] back to in1[i] and in2[i] indicate that writing to sum 
may have an effect on the memory pointed to by either in1 or in2. 


(J A read from int or in2 cannot begin until the write to sum finishes, which 
creates an aliasing problem. Aliasing occurs when two pointers can point 
to the same memory location. For example, if vecsum( ) is called in a pro- 
gram with the following statements, in1 and sum alias each other because 
they both point to the same memory location: 


short a[10], b[10]; 
vecsum(a, a, b, 10); 


4.2.2.1. The const Keyword 
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In Figure 4—1, the reads from in1 and in2 finish before the write to sum within 
a single iteration. However, the ’C6x compiler uses software pipelining to exe- 
cute multiple iterations in parallel and, therefore, must determine memory 
dependencies that exist across loop iterations. 


To help the compiler, you can qualify an object with the const keyword, which 
indicates that a variable or the memory referenced by a variable will not be 
changed, but will remain constant. It is good coding practice to use the const 
keyword wherever you can, because it is a simple way to increase the perfor- 
mance and robustness of your code. 


Compiling C Code 


Example 4—2 shows the vecsum( ) example rewritten with the const keyword 
to indicate that the write to sum never changes the memory referenced by in1 
and in2. Figure 4—2 shows the revised dependency graph for the code in the 
inner loop. 


Example 4-2. Vector Sum With const Keywords 


void vecsum2(short *sum, const short *inl, const short *in2, unsigned int N) 


int 1; 
for (i = 0; i < Nj itt) 
sum[i] = inl[i] + in2[i]; 


Figure 4-2. Dependency Graph for Vector Sum #2 
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Example 4-3 shows the output of the compiler for the vector sum in 
Example 4—2. The compiler finds better schedules when dependency paths 
are eliminated between instructions. For this loop, the compiler found a soft- 
ware pipeline with a 2-cycle kernel, compared with seven cycles for the 
previous loop. (The kernel is the body of a pipelined loop where all instructions 
execute in parallel.) 
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Example 4—3. Compiler Output for Vector Sum Code 


L14: ; PIPE LOOP KERNEL 
ADD .L1X B4,A0,A5 
[BO] B 82 L14 
LDH .D1 *A3++,A0 
STH Di A5, *A4++ 
[BO] SUB sli2 BO,1,BO 
LDH .D2 *B5++,B4 


For basic information on assembly code, see Chapter 4, Structure of Assem- 
bly Code. 


4.2.2.2. Performing Program-Level Optimization (-pm Option) 


You can specify program-level optimization by using the —pm option with the 
—03 option. With program-level optimization, all your source files are compiled 
into one intermediate file called a module. The module moves to the optimiza- 
tion and code generation passes of the compiler. Because the compiler has 
access to the entire program, it performs several optimizations that are rarely 
applied during file-level optimization: 


(1 If aparticular argument in a function always has the same value, the com- 
piler replaces the argument with the value and passes the value instead 
of the argument. 


_j Ifareturn value of a function is never used, the compiler deletes the return 
code in the function. 


Lj Ifa function is not called, directly or indirectly, the compiler removes the 
function. 


4.2.2.3. The -—mt Option 


Another way to eliminate memory dependencies is to use the —mt option, 
which allows the compiler to use assumptions that can eliminate memory de- 
pendency paths. For example, if you use the —mt option when compiling the 
code in Example 4—1, the compiler uses the assumption that that in1 and in2 
do not alias memory pointed to by sum and, therefore, eliminates memory 
dependencies among the instructions that access those variables. 


4.3 Refining C Code 


Refining C Code 


You can realize substantial gains from the performance of your C code by refin- 
ing your code in the following areas: 


L} 
Lj 


4.3.1 Using Intrinsics 


Using intrinsics to replace complicated C code 


Using word access to operate on 16-bit data stored in the high and low 


parts of a 32-bit register 


Software pipelining the instructions manually 


Using double access to operate on 32-bit data stored in a 64-bit register 


pair ('C67x only) 


The ’C6x compiler provides intrinsics, special functions that map directly to 
inlined ’‘C62x/’C67x instructions, to optimize your C code quickly. All instruc- 
tions that are not easily expressed in C code are supported as intrinsics. Intrin- 
sics are specified with a leading underscore (_) and are accessed by calling 
them as you call a function. 


For example, saturated addition can be expressed in C code only by writing 
a multicycle function, such as the one in Example 4-4. 


Example 4—4. Saturated Add Without Intrinsics 


int sadd(int a, int b) 
{ 
int result; 


result =a +b; 


“ b) & 0x80000000) == 0) 


((result * a) & 0x80000000) 
result = 


} 


(a < 0) 


} 


return (result); 


2? Ox80000000 : 


Ox7£f£fLLLFLF; 


This complicated code can be replaced by the _sadd( ) intrinsic, which results 


in a single ’C6x instruction (see Example 4—5). 
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Example 4—5. Saturated Add With Intrinsics 


result = _sadd(a,b)j; 


Table 4—2 lists the ’C6x intrinsics. For more information on using intrinsics, see 
the TMS320C6x Optimizing C Compiler User’s Guide. 


Table 4-2. TMS320C6x C Compiler Intrinsics 


Assembly 
C Compiler Intrinsic Instruction Description Device 


int _abs(int src2); ABS Returns the saturated absolute value of 
int_labs(long src2); src2. 


int __adda(int src7, int src2); ADD2 Adds the upper and lower halves of srci to 
the upper and lower halves of src2 and re- 
turns the result. Any overflow from the 
lower half add will not affect the upper half 
add. 


uint _clr(uint src2, uint csta, uint cstb); CLR Clears the specified field in src2. The 
beginning and ending bits of the field to be 
cleared are specified by csta and cstb, 
respectively. 


unsigned _clrr(uint src7, int src2); CLR Clears the specified field in src2. The 
beginning and ending bits of the field to be 
cleared are specified by the lower 10 bits 
of the source register. 


int_dpint(double); DPINT Converts 64-bit double to 32-bit signed in- *C67x 
teger, using the rounding mode set by the 
CSR register. 


int __ext(uint src2, uint csta, int cstb); EXT Extracts the specified field in src2, sign-ex- 
tended to 32 bits. The extract is performed 
by a shift left followed by a signed shift 
right; csta and cstb are the shift left and 
shift right amounts, respectively. 


int __extr(int src2, int src7); EXT Extracts the specified field in src2, sign-ex- 
tended to 32 bits. The extract is performed 
by a shift left followed by a signed shift 
right; csta and cstb are the shift left and 
shift right amounts, respectively. 


Note: Instructions not specified with a device apply to all ‘C6x devices. 
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Table 4-2. TMS320C6x C Compiler Intrinsics (Continued) 


Assembly 
C Compiler Intrinsic Instruction Description Device 
uint _extu(uint src2, uint csta, uint cstb); EXTU Extracts the specified field in src2, zero- 
extended to 32 bits. The extract is 
performed by a shift left followed by a 
unsigned shift right; csta and cstb are the 
shift left and shift right amounts, respec- 
tively. 
uint _extur(uint src2, int src7); EXTU Extracts the specified field in src2, zero- 
extended to 32 bits. The extract is 
performed by a shift left followed by a 
unsigned shift right; csta and cstb are the 
shift left and shift right amounts, respec- 
tively. 
uint _ ftoi(float); Reinterprets the bits in the floatas anun- ’C67x 
signed integer. 
(Ex: _ftoi(1.0) == 1065353216U) 
uint _hi(double); Returns the high 32 bits of adouble asan ’C67x 
integer. 
double _itod(uint, uint); Creates a new double register pair from °C67x 
two unsigned integers. 
float _itof(uint); Reinterprets the bits in the unsigned inte- ’C67x 
ger as a float. 
(Ex: _itof(Ox3f800000) == 1.0) 
uint _Imbd(uint src7, uint src2); LMBD Searches for aleftmost 1 or 0 of src2deter- 
mined by the LSB of src7. Returns the 
number of bits up to the bit change. 
uint _lo(double); Returns the low (even) register ofadouble *C67x 
register pair as an integer. 
int_mpy(int src7, int src2); MPY Multiplies the 16 LSBs of src1 by the 16 
int_mpyus(uint src7, int src2); MPYUS LSBs of src2 and returns the result. Values 
int_mpysu(int src7, uint src2); MPYSU can be signed or unsigned. 
uint_mpyu(uint src7, uint src2); MPYU 
int_mpyh(int src7, int src2); MPYH Multiplies the 16 MSBs of src1 by the 16 
int_mpyhus(uint src7, int src2); MPYHUS MSBs of src2 and returns the result. 
int_mpyhsu(int src7, uint src2); MPYHSU Values can be signed or unsigned. 
uint_mpyhu(uint src7, uint src2); MPYHU 


Note: Instructions not specified with a device apply to all ’C6x devices. 
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Table 4-2. TMS320C6x C Compiler Intrinsics (Continued) 


Assembly 
C Compiler Intrinsic Instruction Description Device 
int_mpyhl(int src7, int src2); MPYHL Multiplies the 16 MSBs of src1 by the 16 
int_mpyhuls(uint src7, int src2); MPYHULS _ LSBs ofsrc2 and returns the result. Values 
int_mpyhslu(int src7, uint src2); MPYHSLU can be signed or unsigned. 
uint_mpyhlu(uint src7, uint src2); MPYHLU 
int_mpylh(int src7, int src2); MPYLH Multiplies the 16 LSBs of src1 by the 16 
int_mpyluhs(uint src7, int src2); MPYLUHS' MSBs of src2 and returns the result. 
int_mpylshu(int src7, uint src2); MPYLSHU _ Values can be signed or unsigned. 
uint_mpylhu(uint src7, uint src2); MPYLHU 
void _nassert(int); Generates no code. Tells the optimizer 
that the expression declared with the 
assert function is true; this gives a hint to 
the optimizer as to what optimizations 
might be valid. 
uint_norm(int src2); NORM Returns the number of bits up to the first 
uint _Inorm(long src2); nonredundant sign bit of src2. 
double _rcepdp(double); RCPDP Computes the approximate 64-bit double ‘C67x 
reciprocal. 
float _rcpsp(float); RCPSP Computes the approximate 64-bit double *C67x 
reciprocal. 
double _rsqrdp(double src); RSQRDP Computes the approximate 64-bit double ’‘C67x 
reciprocal square root. 
float _rsqrsp(float src); RSQRSP Computes the approximate 32-bit float re- *C67x 
ciprocal square root. 
int_sadd(int src7, int src2); SADD Adds srci to src2 and saturates the result. 
long _Isadd(int src7, long src2): Returns the result. 
int_sat(long src2); SAT Converts a 40-bit value to an 32-bit value 
and saturates if necessary. 
uint _set(uint src2, uint csta, uint cstb); SET Sets the specified field in src2 to all 1s and 


Note: 
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returns the src2 value. The beginning and 
ending bits of the field to be set are speci- 
fied by csta and cstb, respectively. 


Instructions not specified with a device apply to all ’C6x devices. 
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Table 4-2. TMS320C6x C Compiler Intrinsics (Continued) 


Assembly 

C Compiler Intrinsic Instruction Description Device 
unsigned _setr(unsigned, int); SET Sets the specified field in src2 to all 1s and 

returns the src2 value. The beginning and 

ending bits of the field to be set are speci- 

fied by the lower ten bits of the source reg- 

ister. 
int_smpy(int src7, int sr2); SMPY Multiplies src1 by src2, left shifts the result 
int_smpyh(int src7, int sr2); SMPYH by one, and returns the result. If the result 
int_smpyhl(int src7, int sr2); SMPYHL is 0x80000000, saturates the result to 
int_smpylh(int src7, int sr2); SMPYLH Ox7FFF FFFF. 
int _spint(float); SPINT Converts 32-bit float to 32-bit signed inte- °C67x 

ger, using the rounding mode set by the 

CSR register. 
uint __sshl(uint src2, uint src); SSHL Shifts src2 left by the contents of src1, sat- 

urates the result to 32 bits, and returns the 

result. 
int _ssub(int src7, int src2); SSUB Subtracts src2 from srci1, saturates the 
long _Issub(int src7, long src2): result size, and returns the result. 
uint _sube(uint src7, uint src2); SUBC Conditional subtract divide step. 
int __sub2(int src7, int src2); SUB2 Subtracts the upper and lower halves of 

src2 from the upper and lower halves of 

src?, and returns the result. Any borrowing 

from the lower half subtract does not affect 

the upper half subtract. 
Note: Instructions not specified with a device apply to all ’C6x devices. 
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4.3.2 Using Word Access for Short Data 


The ’C6x has instructions with corresponding intrinsics, such as _add2( ), 
_mpyhi( ), mpylh( ), that operate on 16-bit data stored in the high and low 
parts of a 32-bit register. When operating on a stream of short data, you can 
use word (int) accesses to read two short values at a time, and then use ’C6x 
intrinsics to operate on the data. For example, rewriting the vecsum( ) function 
to use word accesses (as in Example 4—6) doubles the performance of the 
loop. See section 6.3, Loading Two Data Values with LDW, on page 6-15 for 
more information. 


Example 4-6. Vector Sum With const Keywords, _nassert, Word Reads 


{ 


int 1; 


const int 
const int 
int 


for (i = 0; 
i] 


i_sum[ 


void vecsum4(short *sum, const short *inl, const short *in2, unsigned int N) 


*i inl = (const int *)inl; 
*i_in2 = (const int *)in2; 
*i_sum = (int *) sum; 


_nassert(N >= 20); 


al 


= _add2(i_inl[i], i_in2[i]); 


(N/2); i++) 
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Note: 


The _nassert intrinsic tells the optimizer that the code that follows meets the 
condition specified. 


sss | 


This transformation assumes that the pointers sum, in1, and in2 can be cast 
to int *, which means that they must point to word-aligned data. By default, the 
compiler aligns all short arrays on word boundaries; however, a call like the 
following creates an illegal memory access: 


short a[5l1], b[50], c[50]; vecsum4(&a[1l], b, c, 50); 


Another consideration is that the loop must now run for an even number of 
iterations. You can ensure that this happens by padding the short arrays so 
that the loop always operates on an even number of elements. 


Refining C Code 


If a vecsum( ) function is needed to handle short-aligned data and odd-num- 
bered loop counters, then you must add code within the function to check for 
these cases. Knowing what type of data is passed to a function can improve 
performance considerably. It may be useful to write different functions that can 
handle different types of data. If your short-data operations always operate on 
even-numbered word-aligned arrays, then the performance of your applica- 
tion can be improved. However, Example 4—7 provides a generic vecsum( ) 
function that handles all types of data. 


Example 4—7. Vector Sum With const Keywords, _nassert, Word Reads (Generic Version) 


void vecsum5(short *sum, const short *inl, const short *in2, unsigned int N) 


{ 


anit. As 


_nassert(N >= 20); 


if (((int)sum | (int)in2 | (int)inl) & 0x2) 
{ 
for (i = 0; i < Nj; itt) 
sum[i] = inl[i] + in2[i]; 
} 
else 


const int *i_inl = (const int *)inl; 
const int *i_in2 = (const int *)in2; 
int *i_sum = (int *) sum; 


for (i = 0; i < (N/2); itt) 
i_sum[i] = _add2(i_inl[i], i_in2[i]); 


if (N & Oxl) sum[i] = inl[i] + in2[i]; 


4.3.2.1 Using Word Access in Dot Product 


Other intrinsics that are useful for reading short data as words are the multiply 
intrinsics. Example 4—8 is a dot product example that reads word-aligned short 
data and uses the _mpy( ) and_mpyh( ) intrinsics. The _mpyh( ) intrinsic uses 
the ’C6x instruction MPYH, which multiplies the high 16 bits of two registers, 
giving a 32-bit result. 


This example also uses two sum variables (Sum1 and sum2). Using only one 
sum variable inhibits parallelism by creating a dependency between the write 
from the first sum calculation and the read in the second sum calculation. 
Within a small loop body, avoid writing to the same variable, because it inhibits 
parallelism and creates dependencies. 
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Example 4—8. Dot Product Using Intrinsics 


int dotprod(const short 
{ 
ani a; sum2 


suml = 0, 


const int *i_a 


*a, const short *b, unsigned int N) 


= 0; 


(const int *)a; 


const ant *i_b (const 2nt *)bs 


for (i = 0; i < (N >> 1); itt) 
{ 
suml = suml + _mpy (i_ali], i_b[il); 
sum2 = sum2 + _mpyh(i_a[i], i_b[i]); 
} 
return suml + sum2; 


4.3.2.2 Using Word Access in FIR Filter 


Example 4—9 shows an FIR filter that can be optimized with word reads of short 
data and multiply intrinsics. 


Example 4—9. FIR Filter—Original Form 


Vo 


{ 


id firl(const short x[], const short h[], short y[], int n, int m, int s) 
Init ay “2? 
long y0; 
long round = 1L << (s - 1); 
for (j = 0; 3 < m; jtt) 
{ 

yO = round; 

for (i = 0; i < n; itt) 

yO += x[i + 3] * hfil; 
ylj] = (int) (yO >> s); 
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Example 4-10 shows an optimized version of Example 4-9. The optimized 
version passes an int array instead of casting the short arrays to int arrays and, 
therefore, helps ensure that data passed to the function is word-aligned. As- 
suming that a prototype is used, each invocation of the function ensures that 
the input arrays are word-aligned by forcing you to insert a cast or by using int 
arrays that contain short data. 


Refining C Code 


Example 4—10. FIR Filter— Optimized Form 


void fir2(const int x[], const int h[], short y[], int n, int m, int s) 
{ 

int i, j; 

long yO, yl; 

long round = 1L << (s - 1); 


_nassert(m >= 16); 
_nassert(n >= 16); 


for (j = 0; 3 < (m >> 1); J++) 
{ 


for (i = 0; i < (n >> 1); i++) 

{ 
yO += _mpy (xfi + jl, h[il]); 
yO += _mpyh (x[i+ jl, h[i]); 
yl += _mpyhl(x[i+ jl, hfil); 
yl += _mpylh(xf[i + j + 1], h[il); 

} 

ayt (int) (yO >> s) 


4.3.2.3. Using Double Word Access for Word Data (’C67x Specific) 


The ’C67x architecture has a load double word (LDDW) instruction, which can 
read 64 bits of data into a register pair. Just like using word acesses to read 
2 short data items, double word acesses can be used to read 2 word data items 
(or 4 short data items). When operating on a stream of float data, you can use 
double accesses to read 2 float values at a time, and then use intrinsics to op- 
erate on the data. 


The basic float dot product is shown in Example 4—11. Since the float addition 
(ADDSP) instruction takes 4 cycles to complete, the minimum kernel size for 
this loop is 4 cycles. For this version of the loop, a result is completed every 
4 cycles. 
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Example 4—11. Basic Float Dot Product 


float dotpl(const float a[], const float b[]) 
{ 

amt, “dug 

float sum = 0; 


for (i=0; 1<512; i++) 
sum += a[i] * b[i]; 


return sum; 


In Example 4—12, the dot product example is rewritten to use double word 
loads and instrincs are used to extract the high and low 32-bit values contained 
in the 64-bit double. The _hi() and _lo() instrinsics return integer values, the 
_itof() intrinsic subverts the C typing system by interpreting an integer value 
as a float value. In this version of the loop, 2 float results are computed every 
4 cycles. 


Example 4—12. Float Dot Product Using Intrinsics 


float dotp2(const double a[], const double b[]) 
{ 

aim: -auy 

float sum0O = 0; 

float suml = 0; 


for (i=0; i<512; i+=2) 


_itof(_hi(a[i])) * ~itof(_hi(b[i])); 
_itof(_lo(a[itl])) * _itof(_lo(b[i+2])); 


return sum0 + suml; 


In Example 4-13, the dot product example is unrolled to maximize perfor- 
mance. The preprocessor is used to define convenient macros FHI() and 
FLO() for accessing the high and low 32-bit values in a double word. In this 
version of the loop, 8 float values are computed every 4 cycles. 
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Example 4—13. Float Dot Product With Peak Performance 


Refining C Code 


#define 
#define 


FHI (a) 
FLO (a) 


_itof (_hi (a) ) 
_itof(_lo(a)) 


float dotp3(const double a[], 


{ 


const double b[]) 


aa 


c 


ae) 


E 


t t 
~ 


~ 


int i; 

float sum0 = 0; 

float suml = 0; 

float sum2 = 0; 

float sum3 = 0; 

float sum4 = 0; 

float sum5 = 0; 

float sum6 = 0; 

float sum7 = 0; 

for (i=0; i<512; i+=4) 

{ 
sum0 += FHI (a[i]) * 
suml += FLO(a[i]) * 
sum2 += FHI (a[it+l]) * 
sum3 += FLO(a[it+l]) * 
sum4 += FHI (a[it+2]) * 
sum5 += FLO(a[it+2]) * 
sum6 += FHI (a[it+3]) * 
sum7 += FLO(a[it+3]) * 

} 

sum0 += suml; 

sum2 += sum3; 

sum4 += sum5; 

sum6 += sum7; 

sum0O += sum2; 

sum4 += sum6; 
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return sum0O + sum4; 


Optimizing C Code 4-19 


Part Il 


Part Il 


Refining C Code 


4.3.3 Software Pipelining 


Software pipelining is a technique used to schedule instructions from a loop 
so that multiple iterations of the loop execute in parallel. When you use the —02 
and —o3 compiler options, the compiler attempts to software pipeline your 
code with information that it gathers from your program. 


Figure 4-2 illustrates a software-pipelined loop. The stages of the loop are 
represented by A, B, C, D, and E. In this figure, a maximum of five iterations 
of the loop can execute at one time. The shaded area represents the loop ker- 
nel. In the loop kernel, all five stages execute in parallel. The area immediately 
before the kernel is known as the pipelined-loop prolog, and the area immedi- 
ately following the kernel is known as the pipelined-loop epilog. 


Figure 4—3. Software-Pipelined Loop 
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Pipelined-loop prolog 


Kernel 


Pipelined-loop epilog 


Because loops present critical performance areas in your code, consider the 
following areas to improve the performance of your C code: 


Trip count 

Redundant loops 
Loop unrolling 
Speculative execution 


UU 
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4.3.3.1 Trip Count Issues 


A trip count is the number of times that a loop executes; the trip counter is the 
variable used to count each iteration. When the trip counter reaches a limit 
equal to the trip count, the loop terminates. The structure of a software pipeline 
requires the execution of a minimum number of loop iterations (a minimum trip 
count) in order to fill, or prime, the pipeline. 


Loops that are eligible for software pipelining have loop trip counters that count 
down. In most cases, the compiler can transform the loop to use a trip counter 
that counts down even if the original code was not written that way. 


For example, the optimizer transforms the loop in Example 4—14(a) to some- 
thing like the code in Example 4—14(b). 


Example 4—14. Trip Counters 
(a) Original code 


for (i = 0; i < N; itt+) /* i = trip counter, N = trip count */ 


(b) Optimized code 


for (i = N; i != 0; i--) /* Downcounting trip counter */ 


The minimum trip count for a software pipelined loop is determined by the mini- 
mum number of times the loop will execute. 


If the compiler knows the trip count, it can generate faster and more compact 
code. If the compiler cannot determine that a loop always executes for the 
minimum trip count, it generates a redundant unpipelined loop. The redundant 
unpipelined loop is executed only when the runtime trip count is less than the 
minimum trip count; otherwise, the software-pipelined version of the loop is 
executed. 
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4.3.3.2 Eliminating Redundant Loops 


In Example 4-2 on page 4-7, the compiler cannot determine if the loop 
always executes more than the minimum trip count. Therefore, it generates 
two versions of the loop: 


_) An unpipelined version that executes if N is less than the minimum trip 
count 


Li Asoftware-pipelined version that executes if N is equal to or greater than 
the minimum trip count 


To indicate to the compiler that you do not want two versions of the loop, you 
can use the —ms option so that the compiler generates only the software-pipe- 
lined code and never generates a redundant loop; however, loops with an 
unknown trip count are not software pipelined. 


4.3.3.3. Communicating Trip-Count Information to the Compiler 


When invoking the compiler, use the following options to communicate trip- 
count information to the compiler: 


(1 Use the-—o03 and—pm compiler options to allow the optimizer to access the 
whole program or large parts of it and to characterize the behavior of loop 
trip counts. 


(1 Use the _nassert intrinsic to help reduce code size by preventing the 
generation of a redundant loop or by allowing the compiler (with or without 
the —ms option) to software pipeline innermost loops. 


Example 4—15 shows the vector sum code with an _nassert intrinsic that 
asserts that N is always at least 10. 


Example 4—15. Vector Sum With const Keywords and _nassert 


void vecsum3 (short *sum, const short *inl, const short *in2, unsigned int N) 
{ 


int 2 
_nassert(N >= 10); 


for (i = 0; i < N; i++) 
sum[i] = inl[i] + in2[il]l; 


See the TMS320C6x Optimizing C Compiler User’s Guide for a complete 
discussion of the —ms, —03, and —pm options and the _nassert intrinsic. 
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4.3.3.4 Loop Unrolling 


Another technique that improves performance is unrolling the loop; that is, ex- 
panding small loops so that each iteration of the loop appears in your code. 
This optimization increases the number of instructions available to execute in 
parallel. You can use loop unrolling when the operations in a single iteration 
do not use all of the resources of the ’C6x architecture. 


In Example 4—16, the loop produces a new sun\i] every two cycles. Three 
memory operations are performed: a load for both in1[i] and in2[i] and a store 
for sum[i]. Because only two memory operations can execute per cycle, two 
cycles are necessary to perform three memory operations. 


Example 4—16. Vector Sum With Three Memory Operations 


void vecsum2 (short *sum, const short *inl, const short *in2, unsigned int N) 


Int 2; 
for (i = O07 2 <. N} a4++) 
sum[i] = inl[i] + in2[i]; 


The performance of a software pipeline is limited by the number of resources 
that can execute in parallel. In its word-aligned form (Example 4—17), the vec- 
tor sum loop delivers two results every two cycles because the two loads and 
the store are all operating on two 16-bit values at a time. 


Example 4-17. Word-Aligned Vector Sum 


void vecsum4(short *sum, const short *inl, const short *in2, unsigned int N) 
{ 


int i; 


const int *i_inl 
const int *i_in2 
int *i_sum 


(const ant *) inl> 
(const int *)in2; 
{int *)sum; 


_nassert(N >= 20); 


for (i = 0; i < (N/2); itt) 
i sum[a] = ~add2(i anifal], Ling ii); 
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If you unroll the loop once, the loop then performs six memory operations per 
iteration, which means the unrolled vector sum loop can deliver four results 
every three cycles (that is, 1.33 results per cycle). Example 4—18 shows four 
results for each iteration of the loop: sum[i] and sum[i+sz] each store an int 
value that represents two 16-bit values. 


Example 4—18 is not simple loop unrolling where the loop body is simply repli- 
cated. The additional instructions use memory pointers that are offset to point 
midway into the input arrays and the assumptions that the additional arrays are 
a multiple of four shorts in size. 


Example 4—18. Vector Sum Using const Keywords, _nassert, Word Reads, and 
Loop Unrolling 


void vecsum6(int *sum, const int *inl, const int *in2, unsigned int N) 


{ 
int i; 
int sz =N >> 2; 


_nassert(N >= 20); 


for (i = 0; i < sz; i++) 

{ 
sum[i] = _add2(inl[i], in2[i]); 
sum[i+sz] = _add2(inl[i+sz], in2[i+sz]); 


Software pipelining is performed by the compiler only on inner loops; there- 
fore, you can increase performance by creating larger inner loops. One 
method for creating large inner loops is to completely unroll inner loops that 
execute for a small number of cycles. 


In Example 4—19, the compiler pipelines the inner loop with a kernel size of one 
cycle; therefore, the inner loop completes a result every cycle. However, the 
overhead of filling and draining the software pipeline can be significant, and 
other outer-loop code is not software pipelined. 
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Example 4-19. FIR_Type2— Original Form 


void fir2(const short input[], const short coefs[], short out[]) 
{ 

sly ol ee ea 

int sum = 0; 


for (i = 0; i < 40; i++) 
{ 
for (j = 0; Jj < 16; Jjtt) 
sum += coefs[j] * input[i + 15 - jl]; 


out[i] = (sum >> 15); 
} 
} 


For loops with a simple loop structure, the compiler uses a heuristic to deter- 
mine if it should unroll the loop. Because unrolling can increase code size, in 
some cases the compiler does not unroll the loop. If you have identified this 
loop as being critical to your application, then unroll the inner loop in C code, 
as in Example 4-20. 
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Example 4-20. FIR_Type2—Inner Loop Completely Unrolled 


void fir2_u(const short input[], const short coefs[], short out[]) 
{ 
int a, a7 
int sum; 
for (i = 0; i < 40; i++) 
{ 
sum = coefs[0] * input[i + 15]; 
sum += coefs[1] * input[i 14]; 
sum += coefs[2] * input[i ue 
sum += coefs[3] * input[i 1213 
sum += coefs[4] * input[i easly 
sum += coefs[5] * input[i + 10]; 
sum += coefs[6] * input[i + 9]; 
sum += coefs[7] * input[i + 8]; 
sum += coefs[8] * input[i + 7]; 
sum += coefs[9] * input[i + 6]; 
sum += coefs[{10] * input[i + 5]; 
sum += coefs[11l] * input[i + 4]; 
sum += coefs[12] * input[i + 3]; 
sum += coefs[13] * input[i + 2]; 
sum += coefs[{14] * input[i + 1]; 
sum += coefs[15] * input[i + 0]; 
out[i] = (sum >> 15); 
} 
} 


Now the outer loop is software-pipelined, and the overhead of draining and 
filling the software pipeline occurs only once per invocation of the function 
rather than for each iteration of the outer loop. 


4.3.3.5 Speculative Execution (—mh option) 


The —mh option eliminates the epilog for a software pipelined loop, which can 
result in significant code size savings. Software pipelined loop epilogs can 
often be eliminated if load instructions can be speculatively executed. An in- 
struction is speculatively executed if it is executed before it is known whether 
the result of the instruction is needed. Allowing speculative execution of load 
instructions may result in a read past the beginning or end of a buffer. For a 
complete discussion on the —mh option see the TMS320C6x Optimizing C 
Compiler User’s Guide. 


4.3.3.6 What Disqualifies a Loop from Being Software-Pipelined 


In a sequence of nested loops, the innermost loop is the only one that can be 
software-pipelined. The following restrictions apply to the software pipelining 
of loops: 
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Although a software-pipelined loop can contain intrinsics, it cannot contain 
function calls. 


You must not have a conditional break (early exit) in the loop. 


The loop cannot have an incrementing loop counter. One reason that you 
run the optimizer with the —o2 or —03 option is to convert as many loops 
as possible into downcounting loops. 


If the trip counter is modified within the body of the loop, it typically cannot 
be converted into a downcounting loop. For example, the following code 
is not software-pipelined: 


for (i = 0; i < n; i++) 


{ 


i += x; 


} 


A conditionally incremented loop control variable is not software-pipe- 
lined. For example, the following code is not software-pipelined: 


for (i = 0; i < x; i++) 


If the code size is too large and requires more than the 32 registers in the 
’C6x, it is not software-pipelined. 


If a register value is live too long, the code is not software-pipelined. See 
section 6.5.6.2, Live Too Long, on page 6-63 and section 6.9, Live-Too- 
Long Issues, on page 6-97 for examples of code that is live too long. 


If the loop has complex condition code within the body that requires more 
than the five ’C6x condition registers, the loop is not software pipelined. 
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Chapter 5 


Structure of Assembly Code 


An assembly language program must be an ASCII text file. Any line of 
assembly code can include up to seven items: 


_j Label 
_j Parallel bars 
LJ Conditions 
Lj Instruction 
(J Functional unit 
_] Operands 
LJ Comment 
Topic Page 
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5.1 Labels 


A label identifies a line of code or a variable and represents a memory address 
that contains either an instruction or data. 


Figure 5—1 shows the position of the label in a line of assembly code. The colon 
following the label is optional. 


Figure 5—1. Labels in Assembly Code 


5.2 Parallel Bars 


label: parallel bars [condition] instruction unit operands ; comments 


Labels must meet the following conditions: 


11 The first character of a label must be a letter or an underscore (_) followed 
by a letter. 


Lj The first character of the label must be in the first column of the text file. 


(4 Labels can include up to 32 alphanumeric characters. 


An instruction that executes in parallel with the previous instruction signifies 
this with parallel bars (||). This field is left blank for an instruction that does not 
execute in parallel with the previous instruction. 


Figure 5—2. Parallel Bars in Assembly Code 


label: parallelbars [condition] instruction unit operands ; comments 


5.3 Conditions 


Conditions 


Five registers in the ’C6x are available for conditions: A1, A2, BO, B1, and B2. 
Figure 5-3 shows the position of a condition in a line of assembly code. 


Figure 5—3. Conditions in Assembly Code 


label: parallelbars [condition] instruction unit operands ; comments 


All ’C6x instructions are conditional: 


L} 
L} 


If no condition is specified, the instruction is always performed. 


If a condition is specified and that condition is true, the instruction 
executes. For example: 


With this condition... |The instruction executes if ... 
[Al] A1!=0 
[!A1] Ai =0 


If a condition is specified and that condition is false, the instruction does 
not execute. 


With this condition... |The instruction does not execute if ... 

[Al] Ai=0 = 
5 

[!Al1] Ai! =0 oa 
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5.4 Instructions 


Assembly code instructions are either directives or mnemonics: 


(J Assembler directives are commands for the assembler (asm6x) that 
control the assembly process or define the data structures (constants and 
variables) in the assembly language program. All assembler directives 
begin with a period, as shown in the partial list in Table 5-1. 


Processor mnemonics are the actual microprocessor instructions that 


execute at runtime and perform the operations in the program. Table 5—2 
summarizes the ’C6x mnemonics. Processor mnemonics must begin in 


column 2 or greater. 


Figure 5—4 shows the position of the instruction in a line of assembly code. 


Figure 5—4. Instructions in Assembly Code 


label: parallel bars [condition] instruction unit operands ; comments 
Table 5-1. Selected TMS320C6x Directives 
Directives Description 
sect “name” Creates section of information (data or code) 


.double value 


float value 


value 
value 
value 


int 
.long 
.word 


value 
value 


short 
-half 


-byte value 


Reserve two consecutive 32 bits (64 bits) in memory and 
fill with double-precision (64-bit) IEEE floating-point rep- 
resentation of specified value 


Reserve 32 bits in memory and fill with single-precision 
(32-bit) IEEE floating-point representation of specified 
value 


Reserve 32 bits in memory and fill with specified value 


Reserve 16 bits in memory and fill with specified value 


Reserve 8 bits in memory and fill with specified value 


See the TMS320C6x Assembly Language Tools User’s Guide for a complete 


list of directives. 
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Table 5-2. Selected TMS320C6x Instruction Mnemonics 


Arithmetic 


ABS 

ADD 
ADDA 
ADDK 
ADDPt 
ADDSPt 
ADD2 
DPINTT 
DPSPt 
DPTRUNCT 
INTDPt 
INTSPT 
RCPDPt 
RCPSPt 
RSQRDPt 
RSQRSPt 
SADD 
SAT 
SPDPt 
SPINTT 
SPTRUNCT 
SSUB 
SUB 
SUBA 
SUBC 
SUBDPT 
SUBSPT 
SUB2 


Program 
Multiply Load/Store Control 
MPY LD B 
MPYDPt LDDWwt BIRP 
MPYH MVK B NRP 
MPYHL MVKH 
MPYIt ST 
MPYIDt 
MPYLH 
MPYSPT 
SMPY 


t 'C67x instruction mnemonics only 


Bit 
Management 


CLR 
EXT 
LMBD 
NORM 
SET 


Logical 


AND 
CMPEQ 
CMPEQpDPt 
CMPEQsPt 
CMPGT 
CMPGTDPT 
CMPGTSPt 
CMPLT 
CMPLTDPt 
CMPLTSPT 
OR 

SHL 

SHR 

SSHL 

XOR 


Instructions 


Pseudo/Other 


IDLE 
MV 
MVC 
NOP 
ZERO 
NEG 
NOT 
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See the TMS320C62x/C67x CPU and Instruction Set Reference Guide for a 


complete list of instructions. 
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5.5 Functional Units 


The ’C6x CPU contains eight functional units, which are shown in Figure 5-5 
and described in Table 5-3. 


Figure 5—5. TMS320C6x Functional Units 
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Table 5—3. Functional Units and Descriptions 


Functional Unit Description 


JE Winit (JE, LZ) 32/40-bit arithmetic and compare operations 
Left most 1, 0, bit counting for 32 bits 
Normalization count for 32 and 40 bits 
32 bit logical operations 


32/64-bit IEEE floating-point arithmetict 
Floating-point/fixed-point conversionst 


.S unit (.S1, .S2) 32-bit arithmetic operations 
32/40 bit shifts and 32-bit bit-field operations 
32 bit logical operations 
Branching 
Constant generation 
Register transfers to/from the control register file 


32/64-bit IEEE floating-point compare operationst 
32/64-bit IEEE floating-point reciprocal and square root 
reciprocal approximationt 


.M unit (.M1,.M2) 16 x16 bit multiplies 


32 x 32-bit multipliest 
Single-precision (32-bit) floating-point IEEE multipliest 
Double-precision (64-bit) floating-point IEEE multipliest 


.Dunit(.D1,.D2) 32-bit add, subtract, linear and circular address calcula- 
tion 


tT ’C67x floating-point devices only 
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Figure 5-6 shows the position of the unit in a line of assembly code. 


Figure 5—6. Units in the Assembly Code 


label: parallelbars [condition] instruction unit operands ; comments 


Specifying the functional unit in the assembly code is optional. The functional 
unit can be used to document which resource(s) each instruction uses. 
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5.6 Operands 


The ’C6x architecture requires that memory reads and writes move data 
between memory and a register. Figure 5—7 shows the position of the oper- 
ands in a line of assembly code. 


Figure 5—7. Operands in the Assembly Code 


label: parallel bars [condition] instruction unit operands  ; comments 


Instructions have the following requirements for operands in the assembly 
code: 


Lj All instructions require a destination operand. 
1 Most instructions require one or two source operands. 


1 The destination operand must be in the same register file as one source 
operand. 


1 One source operand from each register file per execute packet can come 
from the register file opposite that of the other source operand. 


When an operand comes from the other register file, the unit includes an X, 
as shown in Figure 5-8, indicating that the instruction is using one of the 
cross paths. (See the TMS320C6x CPU and Instruction Set Reference 
Guide for more information on cross paths.) 


Figure 5—8. Operands in Instructions 


ADD Poe AO0O,A1,A3 


ADD .L1X A0O,B1,A3 


All registers except B1 are on the same side of the CPU. 


The 'C6x instructions use three types of operands to access data: 
1 Register operands indicate a register that contains the data. 


1 Constant operands specify the data within the assembly code. 


(41 Pointer operands contain addresses of data values. 


Only the load and store instructions require and use pointer operands to 
move data values between memory and a register. 


Comments 


5.7 Comments 


As with all programming languages, comments provide code documentation. 
Figure 5-9 shows the position of the comment in a line of assembly code. 


Figure 5-9. Comments in Assembly Code 


label: parallelbars [condition] instruction unit operands ;comments 


The following are guidelines for using comments in assembly code: 


.) Acomment may begin in any column when preceded by a semicolon (;). 
_j Acomment must begin in first column when preceded by an asterisk (*). 
1 Comments are not required but are recommended. 
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Optimizing Assembly Code 
via Linear Assembly 


This chapter describes methods that help you develop more efficient 
assembly language programs, understand the code produced by the 
assembly optimizer, and perform manual optimization. 


This chapter encompasses phase 3 of the code development flow. After you 
have developed and optimized your C code using the 'C6x compiler, extract 
the inefficient areas from your C code and rewrite them in linear assembly (as- 
sembly code that has not been register-allocated and is unscheduled). 


The assembly code shown in this chapter has been hand-optimized in order 
to direct your attention to particular coding issues. The actual output from the 
assembly optimizer may look different, depending on the version you are us- 
ing. 


Topic Page 
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6.1 Assembly Code 


6-2 


The source that you write for the assembly optimizer is similar to assembly 
source code; however, linear assembly does not include information about 
parallel instructions, instruction latencies, or register usage. The assembly op- 
timizer takes care of the difficulties of streamlining your code by: 


[1 Finding instructions that can be executed in parallel 
4 Handling pipeline latencies during software pipelining 
Lj Assigning register usage 

J Defining which unit to use 


Although you have the option with the ’C6x to specify the functional unit or reg- 
ister used, this may restrict the compiler’s ability to fully optimize your code. 
See the TMS320C6x Optimizing C Compiler User's Guide for more informa- 
tion. 


This chapter takes you through the optimization process manually to show you 
how the assembly optimizer works and to help you understand when you might 
want to perform some of the optimizations manually. Each section introduces 
optimization techniques in increasing complexity: 


(1 Section 6.2 and section 6.3 begin with a dot product algorithm to show you 
how to translate the C code to assembly code and then how to optimize 
the linear assembly code with several simple techniques. 


1 Section 6.4 and section 6.5 introduce techniques for the more complex al- 
gorithms associated with software pipelining, such as modulo iteration in- 
terval scheduling for both single-cycle loops and multicycle loops. 


1 Section 6.6 uses an IIR filter algorithm to discuss the problems with loop 
carry paths. 


1 Section 6.7 and section 6.8 discuss the problems encountered with if- 
then-else statements in a loop and how loop unrolling can be used to re- 
solve them. 


_} Section 6.9 introduces live-too-long issues in your code. 


L1 Section 6.10 uses a simple FIR filter algorithm to discuss redundant load 
elimination. 


LJ Section 6.11 discusses the same FIR filter in terms of the interleaved 
memory bank scheme used by ’C6x devices. 


J Section 6.12 and section 6.13 show you how to execute the outer loop of 
the FIR filter conditionally and in parallel with the inner loop. 


Assembly Code 


Each example discusses the: 
_j Algorithm in C code 


Translation of the C code to linear assembly 


iz 
(4 Dependency graph to describe the flow of data in the algorithm 
i 


Allocation of resources (functional units, registers, and cross paths) in lin- 
ear assembly 


a | 


Note: 


There are three types of code for the ’C6x: C code (which is input for the C 
compiler), linear assembly code (which is input for the assembly optimizer), 
and assembly code (which is input for the assembler). 


In the next three sections, we use the dot product to demonstrate how to use 
various programming techniques to optimize both performance and code size. 
Most of the examples provided in this book use fixed-point arithmetic; howev- 
er, the next three sections give both fixed-point and floating-point examples of 
the dot product to show that the same optimization techniques apply to both 
fixed- and floating-point programs. 
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6.2 Writing Parallel Code 


One way to optimize linear assembly code is to reduce the number of execu- 
tion cycles in a loop. You can do this by rewriting linear assembly instructions 
so that the final assembly instructions execute in parallel. 


6.2.1 Dot Product C Code 


The dot product is a sum in which each element in array ais multiplied by the 
corresponding element in array b. Each of these products is then accumulated 
into sum. The C code in Example 6-1 is a fixed-point dot product algorithm. 
The C code in Example 6-2 is a floating-point dot product algorithm. 


Example 6-1. Fixed-Point Dot Product C Code 


int dotp(short a[], short b[]) 
{ 


int sum, i; 
sum = 0; 


for (i=0; i<100; i++) 
sum += a[i] * b[i]; 


return (sum) ; 


Example 6-2. Floating-Point Dot Product C Code 


float dotp(float a[], float b[]) 
{ 
int 1; 
float sum; 
sum = 0; 
for(i=0; i<100; i++) 
sum += a[i] * b[i]; 


return (sum) ; 
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6.2.2 Translating C Code to Linear Assembly 


The first step in optimizing your code is to translate the C code to linear assem- 
bly. 


6.2.2.1 Fixed-Point Dot Product 


Example 6-3 shows the linear assembly instructions used for the inner loop 
of the fixed-point dot product C code. 


Example 6-3. List of Assembly Instructions for Fixed-Point Dot Product 


LDH .D1 *A4++,A2 ; load ai from memory 

LDH {DL *A3++,A5 ; load bi from memory 

MPY .M1 A2,A5,A6 j ai * bi 

ADD -L1 A6,A7,A7 ; sum += (ai * bi) 

SUB wok Al,1,Al1 ; decrement loop counter 
[Al] B 2SZ LOOP ; branch to loop 


The load halfword (LDH) instructions increment through the a and b arrays. 
Each LDH does a postincrement on the pointer. Each iteration of these instruc- 
tions sets the pointer to the next halfword (16 bits) in the array. The ADD in- 
struction accumulates the total of the results from the multiply (MPY) instruc- 
tion. The subtract (SUB) instruction decrements the loop counter. 


An additional instruction is included to execute the branch back to the top of 
the loop. The branch (B) instruction is conditional on the loop counter, A1, and 
executes only until A1 is 0. 


6.2.2.2 Floating-Point Dot Product 


Example 6-4 shows the linear assembly instructions used for the inner loop 
of the floating-point dot product C code. 


Example 6—4. List of Assembly Instructions for Floating-Point Dot Product 


LDW SDL *A4++,A2 ; load ai from memory 

LDW .D2 *A3++,A5 ; load bi from memory 

Mpyspt .M1 A2,A5,A6 ea * Ba 

ADDSPt .L1 A6,A7,A7 ; sum += (ai * bi) 

SUB eOuk Al1,1,Al1 ; decrement loop counter 
[Al] B ~S2 LOOP ; branch to loop 


t ADDSP and MPYSP are ’C67x (floating-point) instructions only. 


The load word (LDW) instructions increment through the a and barrays. Each 
LDW does a postincrement on the pointer. Each iteration of these instructions 
sets the pointer to the next word (32 bits) in the array. The ADDSP instruction 
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accumulates the total of the results from the multiply (MPYSP) instruction. The 
subtract (SUB) instruction decrements the loop counter. 


An additional instruction is included to execute the branch back to the top of 
the loop. The branch (B) instruction is conditional on the loop counter, A1, and 
executes only until A1 is 0. 


6.2.3 Linear Assembly Resource Allocation 


The following rules affect the assignment of functional units for Example 6-3 
and Example 6—4 (shown in the third column of each example): 


Load (LDH and LDW) instructions must use a .D unit. 
Multiply (MPY and MPYSP) instructions must use a .M unit. 
Add (ADD and ADDSP) instructions use a .L unit. 

Subtract (SUB) instructions use a .S unit. 

Branch (B) instructions must use a .S unit. 


OOUUOU 


Note: 


The ADD and SUB can be onthe .S, .L, or .D units; however, for Example 6-3 
and Example 6-4, they are assigned as listed above. 


The ADDSP instruction in Example 6—4 must use a .L unit. 


6.2.4 Drawing a Dependency Graph 


6-6 


Dependency graphs can help analyze loops by showing the flow of instruc- 
tions and data in an algorithm. These graphs also show how instructions 
depend on one another. The following terms are used in defining a depen- 
dency graph. 


Lj A node is a point on a dependency graph with one or more data paths 
flowing in and/or out. 


_j The path shows the flow of data between nodes. The numbers beside 
each path represent the number of cycles required to complete the instruc- 
tion. 


_j An instruction that writes to a variable is referred to as a parent instruction 
and defines a parent node. 


_j Ahn instruction that reads a variable written by a parent instruction is re- 
ferred to as its child and defines a child node. 
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Use the following steps to draw a dependency graph: 


— 


Define the nodes based on the variables accessed by the instructions. 
Define the data paths that show the flow of data between nodes. 

Add the instructions and the latencies. 

Add the functional units. 


ONS 


aN 


6.2.4.1. Fixed-Point Dot Product 


Figure 6-1 shows the dependency graph for the fixed-point dot product 
assembly instructions shown in Example 6—3 and their corresponding register 
allocations. 


Figure 6—1. Dependency Graph of Fixed-Point Dot Product 
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(1 The two LDH instructions, which write the values of ai and bi, are parents 
of the MPY instruction. It takes five cycles for the parent (LDH) instruction 
to complete. Therefore, if LDH is scheduled on cycle i, then its child (MPY) 
cannot be scheduled until cycle i + 5. 


[1 The MPY instruction, which writes the product pi, is the parent of the ADD 
instruction. The MPY instruction takes two cycles to complete. 


(J The ADD instruction adds pi (the result of the MPY) to sum. The output of 
the ADD instruction feeds back to become an input on the next iteration 
and, thus, creates a /oop carry path. (See section 6.6 on page 6-74 for 
more information on loop carry paths.) 
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The dependency graph for this dot product algorithm has two separate parts 
because the decrement of the loop counter and the branch do not read or write 
any variables from the other part. 


(1 The SUB instruction writes to the loop counter, cntr. The output of the SUB 
instruction feeds back and creates a loop carry path. 


_j The branch (B) instruction is a child of the loop counter. 


6.2.4.2 Floating-Point Dot Product 


Similarly, Figure 6-2 shows the dependency graph for the floating-point dot 
product assembly instructions shown in Example 6—4 and their corresponding 
register allocations. 


Figure 6—2. Dependency Graph of Floating-Point Dot Product 
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(J The two LDW instructions, which write the values of ai and bi, are parents 
of the MPYSP instruction. It takes five cycles for the parent (LDW) instruc- 
tion to complete. Therefore, if LDW is scheduled on cycle i, then its child 
(MPYSP) cannot be scheduled until cycle i + 5. 


[1 The MPYSP instruction, which writes the product pi, is the parent of the 
ADDSFP instruction. The MPYSP instruction takes four cycles to complete. 


[1 The ADDSP instruction adds pi (the result of the MPYSP) to sum. The 
output of the ADDSP instruction feeds back to become an input on the next 
iteration and, thus, creates a /oop carry path. (See section 6.6 on page 
6-74 for more information on loop carry paths.) 
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The dependency graph for this dot product algorithm has two separate parts 
because the decrement of the loop counter and the branch do not read or write 
any variables from the other part. 


(1 The SUB instruction writes to the loop counter, cntr. The output of the SUB 
instruction feeds back and creates a loop carry path. 


j) The branch (B) instruction is a child of the loop counter. 
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6.2.5 Nonparallel Versus Parallel Assembly Code 


6.2.5.1. Fixed-Point Dot Product 


Example 6-5. Nonparallel Assembly Code for Fixed-Point Dot Product 


Nonparallel assembly code is performed serially, that is, one instruction follow- 
ing another in sequence. This section explains how to rewrite the instructions 


so that they execute in parallel. 


Example 6-5 shows the nonparallel assembly code for the fixed-point dot 
product loop. The MVK instruction initializes the loop counter to 100. The 
ZERO instruction clears the accumulator. The NOP instructions allow for the 


delay slots of the LDH, MPY, and B instructions. 


Executing this dot product code serially requires 16 cycles for each iteration 
plus two cycles to set up the loop counter and initialize the accumulator; 100 it- 


erations require 1602 cycles. 


LOOP: 


[Al] 


n 


iw) 


100, Al 
Al 


*A4++,A2 
*A3+4+,A5 


A2,A5,A6 
A6,A7,A7 


Al,1,Al1 
LOOP 


; Branch occurs here 


Te 


set up loop counter 
zero out accumulator 


load ai from memory 
load bi from memory 
delay slots for LDH 

ai * bi 

delay slot for MPY 

sum += (ai * bi) 
decrement loop counter 
branch to loop 

delay slots for branch 


Assigning the same functional unit to both LDH instructions slows perfor- 
mance of this loop. Therefore, reassign the functional units to execute the 
code in parallel, as shown in the dependency graph in Figure 6-3. The parallel 


assembly code is shown in Example 6-6. 
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Figure 6—3. Dependency Graph of Fixed-Point Dot Product with Parallel Assembly 


LDH LDH 
Ca) - Co) - 
5 


SUB 


S1 


Example 6-6. Parallel Assembly Code for Fixed-Point Dot Product 


MVK Sal 100, Al ; set up loop counter 

{| ZERO sil A7 ; zero out accumulator 

LOOP: 
LDH .D1 *A4++,A2 ; load ai from memory 

1 | LDH .D2 *B4++,B2 ; load bi from memory 
SUB sol Al,1,Al1 ; decrement loop counter 

[Al] B 282 LOOP ; branch to loop 

OP 2 ; delay slots for LDH = 
PY .M1X A2,B2,A6 ; ai * bi = 
OP ; delay slots for MPY Pg 
ADD sali dl A6,A7,A7 ; sum += (ai * bi) 

; Branch occurs here 


Because the loads of ai and bi do not depend on one another, both LDH 
instructions can execute in parallel as long as they do not share the same 
resources. To schedule the load instructions in parallel, allocate the functional 
units as follows: 


Lj ai and the pointer to ai to a functional unit on the A side, .D1 
Lj bi and the pointer to bi to a functional unit on the B side, .D2 


Because the MPY instruction now has one source operand from A and one 
from B, MPY uses the 1X cross path. 
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Rearranging the order of the instructions also improves the performance of the 
code. The SUB instruction can take the place of one of the NOP delay slots 
for the LDH instructions. Moving the B instruction after the SUB removes the 
need for the NOP 5 used at the end of the code in Example 6-5. 


The branch now occurs immediately after the ADD instruction so that the MPY 
and ADD execute in parallel with the five delay slots required by the branch 
instruction. 


6.2.5.2 Floating-Point Dot Product 


Similarly, Example 6-7 shows the nonparallel assembly code for the floating- 
point dot product loop. The MVK instruction initializes the loop counter to 100. 
The ZERO instruction clears the accumulator. The NOP instructions allow for 
the delay slots of the LDW, ADDSP, MPYSP., and B instructions. 


Executing this dot product code serially requires 21 cycles for each iteration 
plus two cycles to set up the loop counter and initialize the accumulator; 100 it- 
erations require 2102 cycles. 


Example 6—7. Nonparallel Assembly Code for Floating-Point Dot Product 


, 


LOOP: 


Branch occurs here 


Dp 


100, Al ; set up loop counter 
A7 ; zero out accumulator 
*AR4++,A2 load ai from memory 


, 
*A3++,A5 ; load bi from memory 

; delay slots for LDW 
A2,A5,A6 ; ai * bi 

; delay slots for MPYSP 
Ao,A7,A7 ; sum += (ai * bi) 

; delay slots for ADDSP 
Al1,1,Al1 ; decrement loop counter 
LOOP ; branch to loop 

; Gelay slots for branch 


Assigning the same functional unit to both LDW instructions slows perfor- 
mance of this loop. Therefore, reassign the functional units to execute the 
code in parallel, as shown in the dependency graph in Figure 6—4. The parallel 
assembly code is shown in Example 6-8. 


Writing Parallel Code 


Figure 6—4. Dependency Graph of Floating-Point Dot Product with Parallel Assembly 


SUB 


S1 


Example 6-8. Parallel Assembly Code for Floating-Point Dot Product 


MVK Sal 100, Al ; set up loop counter 
{| ZERO sil A7 ; zero out accumulator 
LOOP: 
LDW .D1 *A4++,A2 ; load ai from memory 
1 | LDW .D2 *B4++,B2 ; load bi from memory 
SUB sol Al,1,Al1 ; decrement loop counter 
OP 2 ; delay slots for LDW 
[Al] B -82 LOOP ; branch to loop = 
PYSP .M1X A2,B2,A6 ; ai * bi ie 
OP 2 ; delay slots for MPYSP Pg 
ADDSP .L1 A6,A7,A7 ; sum += (ai * bi) 
; Branch occurs here 


Because the loads of ai and bi do not depend on one another, both LDW 
instructions can execute in parallel as long as they do not share the same 
resources. To schedule the load instructions in parallel, allocate the functional 
units as follows: 


[1 ai and the pointer to ai to a functional unit on the A side, .D1 
[1 bi and the pointer to bi to a functional unit on the B side, .D2 


Because the MPYSP instruction now has one source operand from A and one 
from B, MPYSP uses the 1X cross path. 
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Rearranging the order of the instructions also improves the performance of the 
code. The SUB instruction replaces one of the NOP delay slots for the LDW 
instructions. Moving the B instruction after the SUB removes the need for the 
NOP 5 used at the end of the code in Example 6-7 on page 6-12. 


The branch now occurs immediately after the ADDSP instruction so that the 
MPYSP and ADDSP execute in parallel with the five delay slots required by 
the branch instruction. 


Since the ADDSP finishes execution before the result is needed, the NOP 3 
for delay slots is removed, further reducing cycle count. 
6.2.6 Comparing Performance 


Executing the fixed-point dot product code in Example 6—6 requires eight 
cycles for each iteration plus one cycle to set up the loop counter and initialize 
the accumulator; 100 iterations require 801 cycles. 


Table 6—1 compares the performance of the nonparallel code with the parallel 
code for the fixed-point example. 


Table 6-1. Comparison of Nonparallel and Parallel Assembly Code for Fixed-Point 


Dot Product 
Code Example 100 Iterations = Cycle Count 
Example 6-5 Fixed-point dot product nonparallel assembly 2+100 x 16 1602 
Example 6-6 Fixed-point dot product parallel assembly 1+100x 8 801 


Executing the floating-point dot product code in Example 6-8 requires ten 
cycles for each iteration plus one cycle to set up the loop counter and initialize 
the accumulator; 100 iterations require 1001 cycles. 


Table 6—2 compares the performance of the nonparallel code with the parallel 
code for the floating-point example. 


Table 6—2. Comparison of Nonparallel and Parallel Assembly Code for Floating-Point 


Dot Product 
Code Example 100 Iterations = Cycle Count 
Example 6-7 Floating-point dot product nonparallel assembly 2+ 100 x 21 2102 
Example 6-8 Floating-point dot product parallel assembly 1+100 x 10 1001 
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6.3 Using Word Access for Short Data and Doubleword Access for 


6.3.1 


Floating-Point Data 


The parallel code for the fixed-point example in section 6.2 uses an LDH 
instruction to read ali]. Because a{i] and a[i+1] are next to each other in 
memory, you can optimize the code further by using the load word (LDW) 
instruction to read a[i] and a[i+1] at the same time and load both into a single 
32-bit register. (The data must be word-aligned in memory.) 


In the floating-point example, the parallel code uses an LDW instruction to read 
ali]. Because ali] and a[i+1] are next to each other in memory, you can opti- 
mize the code further by using the load doubleword (LDDW) instruction to read 
ali] and a[i+ 1] at the same time and load both into a register pair. (The data 
must be doubleword-aligned in memory.) See the TMS320C62x/C67x CPU 
and Instruction Set User’s Guide for more specific information on the LDDW 
instruction. 


Te | 


Note: 


The load doubleword (LDDW) instruction is only available on the ’C67x 
(floating-point) device. 


es) 


Unrolled Dot Product C Code 


The fixed-point C code in Example 6—9 has the effect of unrolling the loop by 
accumulating the even elements, ali] and b[i], into sum0 and the odd elements, 
a[i+ 1] and b[i+1], into sum1. After the loop, sum0 and sum1 are added to pro- 
duce the final sum. The same is true for the floating-point C code in 
Example 6-10. (For another example of loop unrolling, see section 6.8 on 
page 6-91.) 


Example 6-9. Fixed-Point Dot Product C Code (Unrolled) 


int dotp(short a[], short b[] ) 
{ 


int sum0O, suml, sum, i; 


sum0 = 0; 
suml = 0; 
for (i=0; 1i<100; it=2) { 
sum0O += a[i] * b[il]l; 
suml += a[i + 1] * b[i + 1]; 
} 
sum = sum0 + suml; 
return (sum); 
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Example 6—10. Floating-Point Dot Product C Code (Unrolled) 


{ 
int i; 
float sum0, 
sumO = 0; 
suml 0; 


sum0 
suml 


} 


sum 


float dotp(float a[], 


suml, 


float b[]) 


sum; 


for(i=0; i<100; it=2) { 
+= ali] 
a[i +1] 


* blil; 
% bi. 17 


sum0O + suml; 
return(sum); 


6.3.2 Translating C Code to Linear Assembly 


The first step in optimizing your code is to translate the C code to linear assem- 


bly. 


6.3.2.1 Fixed-Point Dot Product 


Example 6-11 shows the list of ’C6x instructions that execute the unrolled 
fixed-point dot product loop. Symbolic variable names are used instead of ac- 
tual registers. Using symbolic names for data and pointers makes code easier 
to write and allows the optimizer to allocate registers. However, you must use 
the .reg assembly optimizer directive. See the TMS320C6x Optimizing C 
Compiler User’s Guide for more information on writing linear assembly code. 


Example 6—11. Linear Assembly for Fixed-Point Dot Product Inner Loop with LDW 


= 
5 
o 
LDW *at+t+,ai_il 
LDW *ot++,bi_il 
MPY ai_il,bi_il,pi 
MPYH ai_il,bi_il,pil 
ADD pi,sum0, sum0 
ADD pil,suml, suml 
[cntr] SUB entr,1,cntr 
[entzr] B LOOP 


eT) 


load ai & al from memory 
load bi & bl from memory 
al * bi 

ait+l * bitl 

sum0 (ai * bi) 

suml += (aitl * bi+1) 


; decrement loop counter 
; branch to loop 


The two load word (LDW) instructions load aji], a[i+1], b[i], and b[i+1] on each 


iteration. 
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Two MPY instructions are now necessary to multiply the second set of array 
elements: 


.) The first MPY instruction multiplies the 16 least significant bits (LSBs) in 
each source register: afi] x bfi]. 


[1 The MPYH instruction multiplies the 16 most significant bits (MSBs) of 
each source register: a[i+1] x b [i+1]. 


The two ADD instructions accumulate the sums of the even and odd elements: 
sum0 and sum1. 


Note: 


This is true only when the ’Cé6x is in little-endian mode. In big-endian mode, 
MPY operates on a[i+1] and b[i+1] and MPYH operates on a[i] and b[i]. See 
the TMS320C62x/C67x Peripherals Reference Guide for more information. 


td 


6.3.2.2 Floating-Point Dot Product 


Example 6-12 shows the list of ’C6x instructions that execute the unrolled 
floating-point dot product loop. Symbolic variable names are used instead of 
actual registers. Using symbolic names for data and pointers makes code eas- 
ier to write and allows the optimizer to allocate registers. However, you must 
use the .reg assembly optimizer directive. See the TMS320C6x Optimizing C 
Compiler User’s Guide for more information on writing linear assembly code. 


Example 6-12. Linear Assembly for Floating-Point Dot Product Inner Loop with LDDW 


fentr) 
[entry] 


LDDW 

LDDW 
MPYSP 
MPYSP 
ADDSP 
ADDSP 

SUB 

B 


*at+,ail:aid ; load a[it+0O] & a[itl] from memory 
*b++,bil:bi0 ; load b[it+0] & b[itl] from memory 
ai0,bi0,pi0 ; a[it0O] * b[it0] 

ail,bil,pil * alitl] * Bli+l) 

pid, sum0, sum0 >; sum0 += (a[itO] * b[i+0]) 
pil,suml,suml ; suml += (a[it+l] * b[it+1]) 

contr; 1,-entr ; decrement loop counter 

LOOP ; branch to loop 


The two load doubleword (LDDW) instructions load a{i], a[i+1], b[i], and b[i+1] 
on each iteration. 


Two MPYSP instructions are now necessary to multiply the second set of array 
elements. 


The two ADDSP instructions accumulate the sums of the even and odd 
elements: sum0 and sum1. 
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6.3.3 Drawing a Dependency Graph 


The dependency graph in Figure 6—5 for the fixed-point dot product shows that 
the LDW instructions are parents of the MPY instructions and the MPY instruc- 
tions are parents of the ADD instructions. To split the graph between the A and 
B register files, place an equal number of LDWs, MPYs, and ADDs on each 
side. To keep both sides even, place the remaining two instructions, B and 
SUB, on opposite sides. 


Figure 6—5. Dependency Graph of Fixed-Point Dot Product With LDW 


Aside B side 
LDW ! LDW 
5 a. 


Similarly, the dependency graph in Figure 6-6 for the floating-point dot prod- 
uct shows that the LDDW instructions are parents of the MPYSP instructions 
and the MPYSP instructions are parents of the ADDSP instructions. To split 
the graph between the A and B register files, place an equal number of 
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LDDWs, MPYSPs, and ADDSPs on each side. To keep both sides even, place 
the remaining two instructions, B and SUB, on opposite sides. 


Figure 6—6. Dependency Graph of Floating-Point Dot Product With LDDW 


A side B side 
LDDW LDDW 
; 
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6.3.4 Linear Assembly Resource Allocation 


After splitting the dependency graph for both the fixed-point and floating-point 
dot products, you can assign functional units and registers, as shown in the 
dependency graphs in Figure 6—7 and Figure 6-8 and in the instructions in 
Example 6-13 and Example 6-14. The .M1X and .M2X represent a path in the 
dependency graph crossing from one side to the other. 
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Figure 6—7. Dependency Graph of Fixed-Point Dot Product With LDW (Showing 
Functional Units) 
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Example 6-13. Linear Assembly for Fixed-Point Dot Product Inner Loop With LDW 
(With Allocated Resources) 


LDW -D1 *A4++,A2 ; load ai and ait+tl from memory 
LDW «DZ *B4++,B2 ; load bi and bit+l from memory 
MPY .M1X A2,B2,A6 j ai * bi 
MPYH .M2X A2,B2,B6 j; aitl * bitl 
ADD -L1 A6,A7,A7 ; sum0 += (ai * bi) 
ADD ~L2 B6,B7,B7 ; suml += (aitl * bi+tl1) 
SUB sok Al1,1,Al1 ; Gecrement loop counter 
{Al] B <2 LOOP ; branch to loop 
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Figure 6—8. Dependency Graph of Floating-Point Dot Product With LDDW (Showing 
Functional Units) 
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Example 6—14. Linear Assembly for Floating-Point Dot Product Inner Loop With LDDW 
(With Allocated Resources) 


LDDW -D1 *A4++,A3:A2 load ai and ai+l from memory 
LDDW +D2 *B4++,B3:B2 load bi and bi+1l from memory 
MPYSP .M1X A2,B2,A6 ad = pa. 

MPYSP .M2X A3,B3,B6 ai+l * bitl 

ADDSP .L1 A6,A7,A7 sum0 += (ai * bi) 


eo 


ADDSP  .L2 B6,B7,B7 suml += (aitl * bi+1) 
SUB .S1 A1l,1,Al1 decrement loop counter 
[Al] B .S2 LOOP branch to loop 
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6.3.5 Final Assembly 


Example 6—15 shows the final assembly code for the unrolled loop of the fixed- 
point dot product and Example 6—16 shows the final assembly code for the 
unrolled loop of the floating-point dot product. 


6.3.5.1 Fixed-Point Dot Product 


Example 6—15 uses LDW instructions instead of LDH instructions. 


Example 6—15. Assembly Code for Fixed-Point Dot Product With LDW 
(Before Software Pipelining) 


ADD 
| | ADD 
; Branch 


ADD 


MVK S1 
| | ZERO 1 
|| ZERO 2 
LOOP 
LDW D1 
| | LDW D2 
SUB S1 
[Al] B S1 
NOP 2 
MPY M1X 


«Lid. 
-L2 


.L1X 


50,Al ; set up loop counter 

Al ; zero out sum0O accumulator 

B7 ; zero out suml accumulator 

*A4++,A2 ; load ai & aitl from memory 

*B4++,B2 ; load bi & bitl from memory 

A1,1,Al1 ; Gecrement loop counter 

LOOP ; branch to loop 

A2,B2,A6 pea. ai 

A2,B2,B6 yj aitl * bit+l 

A6,A7,A7 ; sum0+= (ai * bi) 

Bo,B7,B7 7 suml+= (ait+l * bit+l) 
occurs here 

A7,B7,A4 ; sum = sum0 + suml 
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The code in Example 6-15 includes the following optimizations: 


(J The setup code for the loop is included to initialize the array pointers and 
the loop counter and to clear the accumulators. The setup code assumes 
that A4 and B4 have been initialized to point to arrays aand b, respectively. 


[1 The MVK instruction initializes the loop counter. 


[1 The two ZERO instructions, which execute in parallel, initialize the even 
and odd accumulators (sum0 and sum1) to 0. 


Jj The third ADD instruction adds the even and odd accumulators. 
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6.3.5.2 Floating-Point Dot Product 


Example 6-16 uses LDDW instructions instead of LDW instructions. 


Example 6—16. Assembly Code for Floating-Point Dot Product With LDDW 
(Before Software Pipelining) 


MVK oS 50,Al1 ; set up loop counter 
{| ZERO ~L1 Al ; zero out sum0 accumulator 
| ZERO reilps B7 ; zero out suml accumulator 
LOOP 
LDDW .D1 *A44++,A2 ; load ai & ai+l from memory 
{| LDDW .D2 *B4++,B2 ; load bi & bi+l from memory 
SUB cok Al,1,Al1 ; decrement loop counter 
NOP 2 
{Al] B .S1 LOOP ; branch to loop 
MPYSP .M1X A2,B2,A6 * aad * ba 
I | MPYSP .M2X A3,B3,B6 ; ait+l * bitl 
NOP 3 
ADDSP  .L1 A6,A7,A7 ; sum0 += (ai * bi) 
{| ADDSP .L2 B6,B7,B7 ; suml += (ait+l * bitl) 
; Branch occurs here 
NOP 3 
ADDSP .L1X A7,B7,A4 ; sum = sum0 + suml 
NOP 3 


The code in Example 6-16 includes the following optimizations: 


(1 The setup code for the loop is included to initialize the array pointers and 
the loop counter and to clear the accumulators. The setup code assumes 
that A4 and B4 have been initialized to point to arrays aand b, respectively. 


The MVK instruction initializes the loop counter. 


The two ZERO instructions, which execute in parallel, initialize the even 
and odd accumulators (sum0 and sum1) to 0. 


1 The third ADDSP instruction adds the even and odd accumulators. 
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Using Word Access for Short Data and Doubleword Access for Floating-Point Data 


6.3.6 Comparing Performance 


Executing the fixed-point dot product with the optimizations in Example 6-15 
requires only 50 iterations, because you operate in parallel on both the even 
and odd array elements. With the setup code and the final ADD instruction, 100 
iterations of this loop require a total of 402 cycles (1 + 8 x 50+ 1). 


Table 6-3 compares the performance of the different versions of the fixed- 
point dot product code discussed so far. 


Table 6-3. Comparison of Fixed-Point Dot Product Code With Use of LDW 


Code Example 100 Iterations Cycle Count 
Example 6-5 _‘Fixed-point dot product nonparallel assembly 2+ 100 x 16 1602 
Example 6-6 __ Fixed-point dot product parallel assembly 1+100x 8 801 
Example 6-15 Fixed-point dot product parallel assembly with LDW 1+ (50x 8)+1 402 


Executing the floating-point dot product with the optimizations in 
Example 6-16 requires only 50 iterations, because you operate in parallel on 
both the even and odd array elements. With the setup code and the final 
ADDSP instruction, 100 iterations of this loop require a total of 508 cycles (1 
+10 x 50+ 7). 


Table 6—4 compares the performance of the different versions of the floating- 
point dot product code discussed so far. 


Table 6-4. Comparison of Floating-Point Dot Product Code With Use of LDDW 


Code Example 100 Iterations Cycle Count 


Example 6-7 Floating-point dot product nonparallel assembly 2+100 x 21 2102 
Example 6-8 Floating-point dot product parallel assembly 1+100 x 10 1001 
Example 6-16 Floating-point dot product parallel assembly with LDDW 14+ (50x 10)+ 7 508 
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6.4 Software Pipelining 


This section describes the process for improving the performance of the as- 
sembly code in the previous section through software pipelining. 


Software pipelining is a technique used to schedule instructions from a loop 
so that multiple iterations execute in parallel. The parallel resources on the 
’C6x make it possible to initiate a new loop iteration before previous iterations 
finish. The goal of software pipelining is to start a new loop iteration as soon 
as possible. 


The modulo iteration interval scheduling table is introduced in this section as 
an aid to creating software-pipelined loops. 


The fixed-point dot product code in Example 6—15 needs eight cycles for each 
iteration of the loop: five cycles for the LDWs, two cycles for the MPYs, and one 
cycle for the ADDs. 


Figure 6-9 shows the dependency graph for the fixed-point dot product 
instructions. Example 6-17 shows the same dot product assembly code in 
Example 6-13 on page 6-20, except that the SUB instruction is now condition- 
al on the loop counter (A1). 


ae | 


Note: 


Making the SUB instruction conditional on A1 ensures that A1 stops decre- 
menting when it reaches 0. Otherwise, as the loop executes five more times, 
the loop counter becomes a negative number. When A1 is negative, it is non- 
zero and, therefore, causes the condition on the branch to be true again. If the 
SUB instruction were not conditional on A1, you would have an infinite loop. 
a 
The floating-point dot product code in Example 6—16 needs ten cycles for each 
iteration of the loop: five cycles for the LDDWs, four cycles for the MPYSPs, 
and one cycle for the ADDSPs. 


Figure 6-10 shows the dependency graph for the floating-point dot product 
instructions. Example 6-18 shows the same dot product assembly code in 
Example 6-14 on page 6-21, except that the SUB instruction is now condition- 
al on the loop counter (A1). 


ee | 


Note: 


The ADDSP has 3 delay slots associated with it. The extra delay slots are 
taken up by the LDDW, SUB, and NOP when executing the next cycle of the 
loop. Thus an NOP 3 is not required inside the loop but is required outside 


the loop prior to adding sum0 and sum1 together. 
ee | 
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Figure 6—9. Dependency Graph of Fixed-Point Dot Product With LDW 
(Showing Functional Units) 


A side B side 
LDW LDW 
‘D1 bi&bi+1 ) -D2 
5 5 
MPY ~; 
.M1X vt ) .M2X 


.L2 


Example 6-17. Linear Assembly for Fixed-Point Dot Product Inner Loop 
(With Conditional SUB Instruction) 


LDW «Di *A4++,A2 ; load ai and ait+tl from memory 
LDW aDZ *B4++,B2 ; load bi and bit+tl from memory 
MPY .MIX A2,B2,A6 j ai * bi 
MPYH .M2X  A2,B2,B6 j; aitl * bitl 
ADD -L1 A6,A7,A7 ; sum0 += (ai * bi) 
ADD ~L2 B6,B7,B7 ; suml += (ait+l * bitl) 
SSisSuB oSill PNA a NAL ; decrement loop counter 
[Al] B ~S2 LOOP ; branch to top of loop 
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Figure 6—10. Dependency Graph of Floating-Point Dot Product With LDDW 
(Showing Functional Units) 


A side ; B side 
LDDW LDDW 
D1 : D2 
5 BN 
.M2X 
L2 


Example 6-18. Linear Assembly for Floating-Point Dot Product Inner Loop 
(With Conditional SUB Instruction) 


LDDW -D1 *R4++,A2 load ai and ai+l from memory 
LDDW +D2 *B4++,B2 load bi and bi+l from memory 
MPYSP .M1X A2,B2,A6 ad = ba 

MPYSP .M2X A2,B2,B6 aitl * bitl 

ADDSP .L1 A6,A7,A7 sum0 += (ai * bi) 


eT 


ADDSP .L2 B6,B7,B7 suml += (aitl * bi+1) 
[Al] SUB a Sl TNA lp yNAk decrement loop counter 
[Al] B 282 LOOP branch to top of loop 
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6.4.1. Modulo Iteration Interval Scheduling 


Another way to represent the performance of the code is by looking at it ina 
modulo iteration interval scheduling table. This table shows how a 
software-pipelined loop executes and tracks the available resources on a 
cycle-by-cycle basis to ensure that no resource is used twice on any given 
cycle. The iteration interval of a loop is the number of cycles between the initia- 
tions of successive iterations of that loop. 


6.4.1.1 Fixed-Point Example 


The fixed-point code in Example 6-15 needs eight cycles for each iteration of 
the loop, so the iteration interval is eight. 


Table 6—5 shows a modulo iteration interval scheduling table for the fixed-point 
dot product loop before software pipelining (Example 6-15). Each row repre- 
sents a functional unit. There is a column for each cycle in the loop showing 
the instruction that is executing on a particular cycle: 


LDWs on the .D units are issued on cycles 0, 8, 16, 24, etc. 

MPY and MPYH on the .M units are issued on cycles 5, 13, 21, 29, etc. 
ADDs on the .L units are issued on cycles 7, 15, 23, 31, etc. 

SUB on the .S1 unit is issued on cycles 1, 9, 17, 25, etc. 

B on the .S2 unit is issued on cycles 2, 10, 18, 24, etc. 


UOOUUOU 


Table 6-5. Modulo Iteration Interval Scheduling Table for Fixed-Point Dot Product 
(Before Software Pipelining) 


7, 15, ... 
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6.4.1.2 Floating-Point Example 


The floating-point code in Example 6—16 needs ten cycles for each iteration 
of the loop, so the iteration interval is ten. 


Table 6-6 shows a modulo iteration interval scheduling table for the floating- 
point dot product loop before software pipelining (Example 6-16). Each row 
represents a functional unit. There is a column for each cycle in the loop show- 
ing the instruction that is executing on a particular cycle: 


(4 LDDWs on the .D units are issued on cycles 0, 10, 20, 30, etc. 

[1 MPYSPs and on the .M units are issued on cycles 5, 15, 25, 35, etc. 
1 ADDSPs on the .L units are issued on cycles 9, 19, 29, 39, etc. 

1 SUB on the .S1 unit is issued on cycles 3, 13, 23, 33, etc. 

1 Bon the .S2 unit is issued on cycles 4, 14, 24, 34, etc. 


Table 6-6. Modulo Iteration Interval Scheduling Table for Floating-Point Dot Product 
(Before Software Pipelining) 


ADDSP 


ADDSP 


In this example, each unit is used only once every ten cycles. 
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6.4.1.3 Determining the Minimum Iteration Interval 


Software pipelining increases performance by using the resources more effi- 
ciently. However, to create a fully pipelined schedule, it is helpful to first deter- 
mine the minimum iteration interval. 


The minimum iteration interval of a loop is the minimum number of cycles you 
must wait between each initiation of successive iterations of that loop. The 
smaller the iteration interval, the fewer cycles it takes to execute a loop. 


Resources and data dependency constraints determine the minimum iteration 
interval. The most-used resource constrains the minimum iteration interval. 
For example, if four instructions in a loop all use the .S1 unit, the minimum it- 
eration interval is at least 4. Four instructions using the same resource cannot 
execute in parallel and, therefore, require at least four separate cycles to 
execute each instruction. 


With the SUB and branch instructions on opposite sides of the dependency 
graph in Figure 6-9 and Figure 6—10, all eight instructions use a different func- 
tional unit and no two instructions use the same cross paths (1X and 2X). 
Because no two instructions use the same resource, the minimum iteration in- 
terval based on resources is 1. 


—  ————— 
Note: 


In this particular example, there are no data dependencies to affect the 
minimum iteration interval. However, future examples may demonstrate this 


constraint. 
SSS qVil 


6.4.1.4 Creating a Fully Pipelined Schedule 


Having determined that the minimum iteration interval is 1, you can initiate a 
new iteration every cycle. You can schedule LDW (or LDDW) and MPY (or 
MPYSP) instructions on every cycle. 


Fixed-Point Example 
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Table 6—7 shows a fully pipelined schedule for the fixed-point dot product ex- 
ample. 
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Table 6—7. Modulo Iteration Interval Table for Fixed-Point Dot Product 


(After Software Pipelining) 


Unit / Cycle 


Loop Prolog —$§£————————— J 


| 


Low 


KKKKKK 


SUB 


Note: The asterisks indicate the iteration of the loop; shading indicates the single-cycle loop. 


The rightmost column in Table 6—7 is a single-cycle loop that contains the 
entire loop. Cycles 0-6 are loop setup code, or loop prolog. 


Asterisks define which iteration of the loop the instruction is executing each 
cycle. For example, the rightmost column shows that on any given cycle inside 
the loop: 


_j The ADD instructions are adding data for iteration n. 

_} The MPY instructions are multiplying data for iteration n + 2 (**). 
[1 The LDW instructions are loading data for iteration n + 7 (*******). 
[1 The SUB instruction is executing for iteration n + 6 (******). 

11 The B instruction is executing for iteration n + 5 (*****). 


In this case, multiple iterations of the loop execute in parallel in a software pipe- 
line that is eight iterations deep, with iterations n through n + 7 executing in par- 
allel. Fixed-point software pipelines are rarely deeper than the one created by 
this single-cycle loop. As loop sizes grow, the number of iterations that can 
execute in parallel tends to become fewer. 
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Floating-Point Example 


Table 6-8 shows a fully pipelined schedule for the floating-point dot product 
example. 


Table 6-8. Modulo Iteration Interval Table for Floating-Point Dot Product 
(After Software Pipelining) 


_——— Loop Prolog Tr 


8 9, 10, 11... 


RRR K KKKRKEKKKEK 


LDDW LDDW 


KKK KKK RKKKKKKKK 


LDDW LDDW 


upysp | mpysp | mpysp | MPYSP | mPYSP 
upysp | mpysp | mpysp | MPYSP | MPYSP 


aEEIeaEs : 


Note: The asterisks indicate the iteration of the loop; shading indicates the single-cycle loop. 


The rightmost column in Table 6-8 is a single-cycle loop that contains the 
entire loop. Cycles 0-8 are loop setup code, or loop prolog. 


Asterisks define which iteration of the loop the instruction is executing each 
cycle. For example, the rightmost column shows that on any given cycle inside 
the loop: 


The ADDSP instructions are adding data for iteration n. 

The MPYSP instructions are multiplying data for iteration n + 4 (****). 
The LDDW instructions are loading data for iteration n + 9 (*********). 
The SUB instruction is executing for iteration n + 6 (******). 

The B instruction is executing for iteration n + 5 (*****). 


OOUUOUU 
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-_-_.. 


Note: 


Since the ADDSP instruction has three delay slots associated with it, the re- 
sults of adding are staggered by four. That is, the first result from the ADDSP 
is added to the fifth result, which is then added to the ninth, and so on. The 
second result is added to the sixth, which is then added to the 10th. This is 
shown in Table 6-9. 


eee sss) 


In this case, multiple iterations of the loop execute in parallel in a software pipe- 
line that is ten iterations deep, with iterations n through n + 9 executing in paral- 
lel. Floating-point software pipelines are rarely deeper than the one created 
by this single-cycle loop. As loop sizes grow, the number of iterations that can 
execute in parallel tends to become fewer. 


6.4.1.5 Staggered Accumulation With a Multicycle Instruction 


When accumulating results with an instruction that is multicycle (that is, has 
delay slots other than 0), you must either unroll the loop or stagger the results. 
When unrolling the loop, multiple accumulators collect the results so that one 
result has finished executing and has been written into the accumulator before 
adding the next result of the accumulator. If you do not unroll the loop, then the 
accumulator will contain staggered results. 


Staggered results occur when you attempt to accumulate successive results 
while in the delay slots of previous execution. This can be achieved without 
error if you are aware of what is in the accumulator, what will be added to that 
accumulator, and when the results will be written on a given cycle (such as the 
pseudo-code shown in Example 6-19). 


Example 6-19. Pseudo-Code for Single-Cycle Accumulator With ADDSP 


x, sum, sum 
xxptr+t ; x 
cond 
cond, 1, cond 


Table 6-9 shows the results of the loop kernel for a single-cycle accumulator 
using a multicycle add instruction; in this case, the ADDSP, which has three 
delay slots (a 4-cycle instruction). 
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Table 6-9. Software Pipeline Accumulation Staggered Results Due to Three-Cycle 


Delay 
Current value of 
Cycle # Pseudoinstruction pseudoregister sum Written expected result 

0 ADDSP x(0), sum, sum 0 ; cycle 4 sum = x(0) 

1 ADDSP x(1), sum, sum 0 ; cycle 5 sum = x(1) 

2 ADDSP x(2), sum, sum 0 ; cycle 6 sum = x(2) 

3 ADDSP x(3), sum, sum 0 ; cycle 7 sum = x(3) 

4 ADDSP x(4), sum, sum x(0) ; cycle 8 sum = x(0) + x(4) 
5 ADDSP x(5), sum, sum x(1) ; cycle 9 sum = x(1) + x(5) 
6 ADDSP x(6), sum, sum x(6) ; cycle 10 sum = x(2) + x(6) 
7 ADDSP x(7), sum, sum x(7) ; cycle 11 sum = x(8) + x(7) 
8 ADDSP x(8), sum, sum x(0) + x(4) ; cycle 12 sum = x(0) + x(8) 

i+jft ADDSP x(i+j), sum, sum X(j) + X(j+4) + x(j+8) ... x(i-44j) 5 cycle i+ j + 4 sum = x(j) + x(j+4) + 


t where iis a multiple of 4 
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X(j+8) ... x(i-44j) + x(i+)) 


The first value of the array x, x(0) is added to the accumulator (sum) on cycle 
0, but the result is not ready until cycle 4. This means that on cycle 1 when x(1) 
is added to the accumulator (sum), sum has no value in it from x(0). Thus, 
when this result is ready on cycle 5, sum will have the value x(1) in it, instead 
of the value x(0) + x(1). When you reach cycle 4, sum will have the value x(0) 
in it and the value x(4) will be added to that, causing sum = x(0) + x(4) on 
cycle 8. This is continuously repeated, resulting in four separate accumula- 
tions (using the register “sum’). 


The current value in the accumulator “sum” depends on which iteration is be- 
ing done. After the completion of the loop, the last four sums should be written 
into separate registers and then added together to give the final result. This 
is shown in Example 6-23 on page 6-39. 
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6.4.2 Using the Assembly Optimizer to Create Optimized Loops 


Example 6—20 shows the linear assembly code for the full fixed-point dot prod- 
uct loop. Example 6-21 shows the linear assembly code for the full floating- 
point dot product loop. You can use this code as input to the assembly optimiz- 
er tool to create software-pipelined loops automatically. See the TMS320C6x 
Optimizing C Compiler User’s Guide for more information on the assembly op- 
timizer. 


Example 6-20. Linear Assembly for Full Fixed-Point Dot Product 


-global _dotp 
_dotp: .cproc a, b 
.reg sum, sum0, suml, cntr 
.reg ai_il, bi_il, pi, pil 
MVK 50,cntr ; centr = 100/2 
ZERO sum0 ; multiply result = 0 
ZERO sum1 ; multiply result = 0 
LOOP: -trip 50 
LDW *att+,ai_il ; load ai & ait+tl from memory 
LDW *bot++,bi_il ; load bi & bitl from memory 
MPY atti, bi tl) pa # ad ba 
MPYH ai_il,bi_il,pil ; aitl * bi+l 
ADD pi, sum0, sum0 ; sum0 += (ai * bi) 
ADD pil,suml, suml ; suml += (aitl * bitl) 
= 
fentr] SUB enty; 1, enter ; decrement loop counter = 
fhentec) 28 LOOP ; branch to loop 5 
Qo 
ADD sum0, suml, sum ; compute final result 
.-return sum 
-endproc 


Resources such as functional units and 1X and 2X cross paths do not have 
to be specified because these can be allocated automatically by the assembly 
optimizer. 
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Example 6-21. Linear Assembly for Full Floating-Point Dot Product 


pi, 


-global _dotp 
_dotp: «cproc ay: 
.reg sum, sum0, suml, 
.reg aisvail, bivybil, 
MVK 50: ents 
ZERO sum0 
ZERO sum 
LOOP: .trip 50 
LDDW *att+,ai:ail 
LDDW *ot++,bi:bil 
MPYSP a0,b0,pi 
MPYSP al,bl,pil 
ADDSP pi,sum0, sum0 
ADDSP pil,suml, suml 
[centr] SUB entr,1,cntr 
lente] B LOOP 
ADDSP sum, suml1, sum0 
,return sum 
-endproc 


a, b 
pil 
centr = 100/2 


; multiply result = 0 
; multiply result = 0 


load ai & ait+l from memory 
load bi & bi+l from memory 
al * bi 


ait+l * bi+l 
sum0 += (ai * bi) 
suml += (aitl * bi+1) 


decrement loop counter 


; branch to loop 


compute final result 


6.4.3 Final Assembly 
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Example 6-22 shows the assembly code for the fixed-point software-pipe- 
lined dot product in Table 6—7 on page 6-31. Example 6-23 shows the assem- 
bly code for the floating-point software-pipelined dot product in Table 6-8 on 
page 6-32. The accumulators are initialized to 0 and the loop counter is set up 
in the first execute packet in parallel with the first load instructions. The aster- 
isks in the comments correspond with those in Table 6—7 and Table 6-8, re- 
spectively. 


a — = OOO  — — —_—“—“—VW—VwMa—n——GMMGGMGG“GqO0QQQOoOo0i  —T 
Note: 


All instructions executing in parallel constitute an execute packet. An exe- 
cute packet can contain up to eight instructions. 


See the TMS320C62x/C67x CPU and Instruction Set Reference Guide for 


more information about pipeline operation. 
a | 
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6.4.3.1 Fixed-Point Example 


Multiple branch instructions are in the pipe. The first branch in the fixed-point 
dot product is issued on cycle 2 but does not actually branch until the end of 
cycle 7 (after five delay slots). The branch target is the execute packet defined 
by the label LOOP. On cycle 7, the first branch returns to the same execute 
packet, resulting in a single-cycle loop. On every cycle after cycle 7, a branch 
executes back to LOOP until the loop counter finally decrements to 0. Once 
the loop counter is 0, five more branches execute because they are already 
in the pipe. 


Executing the dot product code with the software pipelining as shown in 
Example 6—22 requires a total of 58 cycles (7 + 50 + 1), which is a significant 
improvement over the 402 cycles required by the code in Example 6-15. 


— OO ee 


Note: 


The code created by the assembly optimizer will not completely match the 
final assembly code shown in this and future sections because different ver- 
sions of the tool will produce slightly different code. However, the inner loop 
performance (number of cycles per iteration) should be similar. 


|) 
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Example 6-22. Assembly Code for Fixed-Point Dot Product (Software Pipelined) 


LDW -D1 
LDW -D2 
MVK od 
ZERO 1 
ZERO ~L2 
[Al] SUB ool 
LDW D1 
LDW D2 
[Al] SUB ~Si 
[A1] B ~S2 
LDW a Di. 
LDW DZ 
[Al] SUB ok 
[A1] B sod 
LDW -D1 
LDW ae 
[Al] SUB 2S. 
[A1] B msi’ 
LDW Dil 
LDW D2 
MPY 1X 
MPYH -M2X 
[Al] SUB ork 
[Al] B Pasi” 
LDW D1 
LDW D2 
MPY 1X 
MPYH -M2X 
[Al] SUB ou. 
[Al] B ~S2 
LDW D1 
LDW D2 
LOOP 
ADD ~L1 
|| ADD ~L2 
| | MPY 1X 
| | MPYH 2X 
| | [AL] SUB 2Si 
| | [A1] B -S2 
|| LDW ~DiL 
|| LDW «DZ 
; Branch 
ADD pike X 


*A4++,A2 
*B4++,B2 


50,Al 


Al 
B7 


Al,1, 


Al 


*A4++,A2 
*B4++,B2 


Al,1, 
LOOP 


Al 


*A4++,A2 
*B4++,B2 


Al,1, 
LOOP 

*A4+H 
*B4+ 


Atl, 
LOOP 

*A4H 
*B4+ 


Al 


+, A2 
+, B2 


Al 


+, A2 
+, B2 


A2,B2,A6 
A2,B2,B6 


Al,1, 
LOOP 

*A4H 
*B4+ 


Al 


+, A2 
fy, B2 


A2,B2,A6 
A2,B2,B6 


Al,1, 
LOOP 

*A4H 
*B4+ 


Al 


AG6,AT7,A7 
B6,B7,B7 
A2,B2,A6 
A2,B2,B6 


Aa, Ly 
LOOP 
*A44+4 


Al 


+, A2 


*B4+4 


+, B2 


occurs here 


A7,B7,A4 


’ 


, 


, 


, 


load ai & ait+l from memory 
load bi & bi+l from memory 


set up loop 


counter 


zero out sum0O accumulator 
zero out suml accumulator 


decrement loop counter 
* load ai & ait+l from memory 
* load bi & bit+l from memory 


* decrement loop counter 
branch to loop 


** load ai & 


;** load bi & 


;** decrement 


ait+l from memory 
bitl from memory 


loop counter 


* branch to loop 
7*** load ai & ait+tl from memory 
7*** load bi & bitl from memory 


;** branch to 


p**** load ai 
7**** load bi 


, 


, 


ai * bi 
aitl * bitl 


;*** decrement loop counter 


loop 
& ai+l from memory 
& bi+l from memory 


;**** decrement loop counter 
7*** branch to loop 

7***** Td ai & aitl from memory 
7***** Id bi & bitl from memory 


’ 


, 


a ee Tait 


ee add bad 


7***** decrement loop counter 
7**** branch to loop 

sRRR ERE ld ak 
SE RRR ld bi 


, 
, 


, 


sum0 += (ai 
suml += (aid 


;** ai * bi 


& ai+l from memory 
& bi+l from memory 


* bi) 
Fl * bitl) 


j** ait+l * bit 
7*x*x*** decrement loop counter 
pReeees Dranch 
pxxxeeee Td ai & ait+l fm memory 
7p xxx Td bi & bit+tl fm memory 


, 


HL 


to loop 


sum = sum0 + suml 
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6.4.3.2 Floating-Point Example 


The first branch in the floating-point dot product is issued on cycle 4 but does 
not actually branch until the end of cycle 9 (after five delay slots). The branch 
target is the execute packet defined by the label LOOP. On cycle 9, the first 
branch returns to the same execute packet, resulting in a single-cycle loop. On 
every cycle after cycle 9, a branch executes back to LOOP until the loop count- 
er finally decrements to 0. Once the loop counter is 0, five more branches 
execute because they are already in the pipe. 


Executing the floating-point dot product code with the software pipelining as 
shown in Example 6-23 requires a total of 74 cycles (9 + 50 + 15), whichis a 
significant improvement over the 508 cycles required by the code in 
Example 6-16. 


Example 6-23. Assembly Code for Floating-Point Dot Product (Software Pipelined) 


[Al] 


[Al] 
[Al] 


[Al] 
[Al] 


[Al] 
[Al] 


MVK 
ZERO 
ZERO 
LDDW 
LDDW 


LDDW 
LDDW 


LDDW 
LDDW 


«Si 50,Al1 ; set up loop counter 

-L1 A8 ; sum0 = 0 

.L2 B8 ; suml = 0 

Od A4++,A7:A6 ; load ai & ai + 1 from memory 

-D2 B4++,B7:B6 ; load bi & bi + 1 from memory 

.D1 A4++,A7:A6 ;* load ai & ai + 1 from memory 

.D2 B4++,B7:B6 7* load bi & bi + 1 from memory 

.D1 A4++,A7:A6 ;** load ai & ai + 1 from memory 
.D2 B4++,B7:B6 ;** load bi & bi + 1 from memory 
.D1 A4++,A7:A6 ;*** load ai & ai + 1 from memory 
~D2 B4++,B7:B6 ;*** load bi & bi + 1 from memory 
.S1 Al,1,Al1 ; decrement loop counter 

.D1 A4++,A7:A6 7**** load ai & ai + 1 from memory 
.D2 B4++,B7:B6 7**** load bi & bi + 1 from memory 
.52 LOOP ; branch to loop 

«Sl Al,1,Al1 7* decrement loop counter 

.D1 A4++,A7:A6 7**x*** load ai & ai + 1 from memory 
D2 B4++,B7:B6 7***** load bi & bi + 1 from memory 
.M1X A6,B6,A5 ; pi = ad b0 

.M2X A7,B7,B5 ; pil = al bl 

52 LOOP 7* branch to loop 

«Si Al,1,Al1 7** decrement loop counter 

DL A4++,A7:A6 7******x Load ai & ai + 1 from memory 
+D2 B4++,B7:B6 7****k*x Load bi & bi + 1 from memory 
.M1X A6,B6,A5 ;* pi = a0 b0 

.M2X A7,B7,B5 7* pil = al bl 

~S2 LOOP 7** branch to loop 

61 Al,1,Al1 ;*** decrement loop counter 
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Example 6-23. Assembly Code for Floating-Point Dot Product (Software Pipelined) 


(Continued) 
LDDW D A4++,A7:A6 7******* Load ai & ai + 1 from memory 
LDDW D2 B4++,B7:B6 7 xxx Toad bi & bi + 1 from memory 
MPYSP Xx A6,B6,A5 ;** pi = ad b0 
MPYSP -M2X  A7,B7,B5 ;** pil = al bl 
[Al] B Preys LOOP 7*** branch to loop 
[Al] SUB Ss Al,1,Al1 ;**** decrement loop counter 
LDDW D A4++,A7:A6 7 xxx Toad ai & ai + 1 from memory 
LDDW D2 B4++,B7:B6 7p exe Load bi & bi + 1 from memory 
MPYSP Xx A6,B6,A5 7*** pi = ad b0 
MPYSP -M2X  A7,B7,B5 7*** pil = al bl 
[Al] B «82 LOOP 7**** branch to loop 
[Al] SUB Ss Al,1,Al1 7; ***x** decrement loop counter 
LOOP 
LDDW D A4++,A7:A6 pexxwkKKKX Load ai & ai + 1 from memory 
| | LDDW D2 B4++,B7:B6 pew Load bi & bi + 1 from memory 
| MPYSP M1xX A6,B6,A5 7**** pi = a0 b0 
|| MPYSP M2X A7,B7,B5 7**** pil = al bl 
|| ADDSP Paral A5,A8,A8 ; sum0O += (ai bi) 
|| ADDSP ~L2 B5,B8,B8 suml += (ait+l bitl) 
| | [Al] B »o2 LOOP pxeAS* Dranch to loop 
| | [Al] SUB aS Al,1,Al1 7 xx*x**x* decrement loop counter 
; Branch occurs here 
ADDSP -L1X A8,B8,A0 ; sum(0) = sum0 (0) suml (0) 
ADDSP »L2X A8,B8,BO ; sum(1) = sum0(1) suml (1) 
ADDSP -L1X A8,B8,A0 7; sum(2) = sum0 (2) suml (2) 
ADDSP ~bL2xX A8,B8,BO ; sum(3) = sum0 (3) suml (3) 
NOP ; wait for BO 
ADDSP » LIX AO,BO,A5 ; sum(01) = sum(0) + sum(1) 
NOP j wait for next BO 
ADDSP ~ Lex AO,BO,B5 ; sum(23) = sum(2) + sum(3) 
NOP 3 
ADDSP -L1X A5,B5,A4 7; sum = sum(01) + sum(23) 
NOP 3 r 
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6.4.3.3. Removing Extraneous Instructions 


The code in Example 6—22 and Example 6—23 executes extra iterations of 
some of the instructions in the loop. The following operations occur in parallel 
on the last cycle of the loop in Example 6-22: 


[1 Iteration 50 of the ADD instructions 
Lj Iteration 52 of the MPY and MPYH instructions 
Lj Iteration 57 of the LDW instructions 


The following operations occur in parallel on the last cycle of the loop in 
Example 6-23: 


[] Iteration 50 of the ADDSP instructions 
L] Iteration 54 of the MPYSP instructions 
Lj Iteration 59 of the LDDW instructions 


In most cases, extra iterations are not a problem; however, when extraneous 
LDWs and LDDWs access unmapped memory, you can get unpredictable re- 
sults. If the extraneous instructions present a potential problem, remove the 
extraneous load and multiply instructions by adding an epilog like that included 
in the second part of Example 6-24 on page 6-43 and Example 6-25 on 
page 6-44. 


Fixed-Point Example 


To eliminate LDWs in the fixed-point dot product from iterations 51 through 57, 
run the loop seven fewer times. This brings the loop counter to 43 (50 — 7), 
which means you still must execute seven more cycles of ADD instructions 
and five more cycles of MPY instructions. Five pairs of MPYs and seven pairs 
of ADDs are now outside the loop. The LDWs, MPYs, and ADDs all execute 
exactly 50 times. (The shaded areas of Example 6—24 indicate the changes 
in this code.) 


Executing the dot product code in Example 6—24 with no extraneous LDWs 
still requires a total of 58 cycles (7 + 43 + 7 + 1), but the code size is now larg- 
er. 


Floating-Point Example 


To eliminate LDDWs in the floating-point dot product from iterations 51 through 
59, run the loop nine fewer times. This brings the loop counter to 41 (50 — 9), 
which means you still must execute nine more cycles of ADDSP instructions 
and five more cycles of MPYSP instructions. Five pairs of MPYSPs and nine 
pairs of ADDSPs are now outside the loop. The LDDWs, MPYSPs, and 
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ADDSPs all execute exactly 50 times. (The shaded areas of Example 6-25 in- 
dicate the changes in this code.) 


Executing the dot product code in Example 6—25 with no extraneous LDDWs 
still requires a total of 74 cycles (9 + 41 + 9+ 15), but the code size is now larg- 


er. 


Example 6-24. Assembly Code for Fixed-Point Dot Product (Software Pipelined 
With No Extraneous Loads) 


[Al] 


[Al] 
[Al] 


[Al] 
[Al] 


Al] 
[Al] 
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LDW D1 
LDW -D2 
MVK ol 
ZERO 1 
ZERO elie 
SUB oul. 
LDW D1 
LDW D2 
SUB Sl 
B “OZ, 
LDW -D1 
LDW D2 
SUB Sl 
B “OZ 
LDW -D1 
LDW oDZ 
SUB PaSple 
B ~o2 
LDW D1 
LDW D2 
MPY 1X 
MPYH .M2X 
SUB 51 
B «92 
LDW -D1 
LDW D2 
MPY 1X 
MPYH .M2X 
SUB 51 
B soz 
LDW -D1 
LDW D2 


*AA+ 
*B4+ 


43,Al 


Al 
B7 


Al,1,Al 


*A4H 
*B4+ 


Al,1,Al1 


LOOP 
*A4+ 
*B4+ 


+, A2 
b, B2 


Al,1,Al1 


LOOP 
*A4H 
*B4+ 


Al,1,Al 


LOOP 
*A4H 
*B4+ 


+, A2 
+, B2 


A2,B2,A6 
A2,B2,B6 
Al,1,Al 


LOOP 
*A4+ 
*B4+ 


A2,B2,A6 
A2,B2,B6 
Al,1,Al 


LOOP 
*A4H 
*B4+ 


+, A2 
+, B2 


, 


’ 


’ 


, 


load ai & aid 
load bi & bit 


+l from memory 
+l from memory 


set up loop counter 


zero out sum0O accumulator 


zero out suml 


accumulator 


decrement loop counter 
* load ai & ait+l from memory 
* load bi & bit+l from memory 


* decrement loop counter 
branch to loop 
** load ai & ai+l from memory 


;** load bi & bitl from memory 


;** decrement loop counter 


* branch to loop 
7*** load ai & 
7*** load bi & 


;*** decrement 


ait+tl from memory 
bitl from memory 


loop counter 


;** branch to loop 


7**** load ai & ait+tl from memory 
7**** load bi & bitl from memory 


, 


, 


ai * bi 
ait+tl * bitl 


7 **** decrement loop counter 
s*** branch to 
p**eeee Td ai & 
pReeee Td bi & 


, 


, 


ye Sela A alk 
;* aitl * bitl 


loop 
ait+tl from memory 
bi+l from memory 


7***** decrement loop counter 
puer® branch ‘to: loop 

7xx**** Td ai & aitl from memory 
px***** Td bi & bitl from memory 
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Example 6-24. Assembly Code for Fixed-Point Dot Product (Software Pipelined 
With No Extraneous Loads) (Continued) 


DD .-L1 A6,A7,A7 ; sum0 += (ai * bi) 

DD Peep B6,B7,B7 ; suml += (ait+l * bitl) 

PY .M1X A2,B2,A6 a** ai * bi 
.M2X A2,B2,B6 ;** aitl * bitl 

UB sol Al,1,Al1 7; xx*x*** decrement loop counter 

282 LOOP 7; ***** branch to loop 

LDW .D1 *AR4++,A2 pxxeeeex Td ai & ai+l fm memory 

LDW 22 *B4++,B2 pexeeeex Td bi & bit+l fm memory 

Branch occurs here 


A6,A7,A7 
B6,B7,B7 
A2,B2,A6 
A2,B2,B6 


A6,A7,A7 
B6,B7,B7 
A2,B2,A6 
A2,B2,B6 


A6,A7,A7 
B6,B7,B7 
A2,B2,A6 
A2,B2,B6 


A6,A7,A7 
B6,B7,B7 
A2,B2,A6 
A2,B2,B6 


A6,A7,A7 
B6,B7,B7 
A2,B2,A6 
A2,B2,B6 


A6,A7,A7 
1G ES JB 


A6,A7,A7 
1G EM By 


A7,B7,A4 
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Example 6-25. Assembly Code for Floating-Point Dot Product (Software Pipelined 


With No Extraneous Loads) 


ys) 


UUHEA 
oO 
23500 


LDDW 


41,Al 


A4++,A7:A6 
B4++,B7:B6 


A4++,A7:A6 
B4++,B7:B6 


A4++,A7:A6 
B4++,B7:B6 


A4++,A7:A6 
B4++,B7:B6 
Al,1,Al 


A4++,A7:A6 
B4++,B7:B6 
LOOP 
Al,1,Al 


A4++,A7:A6 
B4++,B7:B6 
A6,B6,A5 
A7,B7,B5 
LOOP 
Al,1,Al 


A4++,A7:A6 
B4++,B7:B6 
A6,B6,A5 
A7,B7,B5 
LOOP 
Al,1,Al 


A4++,A7:A6 
B4++,B7:B6 
A6,B6,A5 
A7,B7,B5 
LOOP 
Al,1,Al 


A4++,A7:A6 
B4++,B7:B6 
A6,B6,A5 
A7,B7,B5 
LOOP 
Al,1,Al 


set up loop counter 

sum0 = 0 

suml = 0 

load ai & ai + 1 from memory 
load bi & bi + 1 from memory 

* load ai & ai + 1 from memory 
* load bi & bi + 1 from memory 
** load ai & ai + 1 from memory 
** load bi & bi + 1 from memory 


7*** load ai & ai + 1 from memory 
7*** load bi & bi + 1 from memory 


, 


decrement loop counter 


7**** load ai & ai + 1 from memory 
7**** load bi & bi + 1 from memory 


, 


branch to loop 
* decrement loop counter 


7***** load ai & ai + 1 from memory 
7***** load bi & bi + 1 from memory 


, 


pi = a0 b0 
pil = al bl 


;* branch to loop 


** decrement loop counter 


7****** Load ai & ai + 1 from memory 
7****** Load bi & bi + 1 from memory 


, 


* pi = a0 b0 
* pil = al bil 
** branch to loop 


;*** decrement loop counter 


7*****e* Load ai & ai + 1 from memory 
7***k*K*e*X Load bi & bi + 1 from memory 
7** pi = ad b0 


** pil = al bl 


7*** branch to loop 
7**** decrement loop counter 


pxxxxeeee Toad ai & ai + 1 from memory 
7 xxx Load bi & bi + 1 from memory 
7*** pi = ad b0 

7*** pil = al bl 

7**** branch to loop 

7***** decrement loop counter 
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Example 6-25. Assembly Code for Floating-Point Dot Product (Software Pipelined 
With No Extraneous Loads) (Continued 


LOOP: 
LDDW -D1 
LDDW -D2 
MPYSP -M1X 
MPYSP M2x 
ADDSP ad: 
ADDSP -L2 
[Al] B S52 
[Al] SUB Ache 
; Branch occurs here 
MPYSP p MinEX 
MPYSP M2x 
ADDSP Li 
ADDSP LZ 
MPYSP -M1X 
MPYSP M2X 
ADDSP Li 
ADDSP L2 
MPYSP -M1X 
MPYSP M2x 
ADDSP Li 
ADDSP LZ 
MPYSP -M1X 
MPYSP M2x 
ADDSP Li 
ADDSP L2 
MPYSP a Max 
MPYSP M2x 
ADDSP Li 
ADDSP L2 
ADDSP Li 
ADDSP L2 
ADDSP Li 
ADDSP L2 
ADDSP Li 
ADDSP L2 
ADDSP Li 
ADDSP L2 


A4++,A7:A6 
B4++,B7:B6 
A6,B6,A5 
A7,B7,B5 
A5,A8,A8 
B5,B8,B8 
LOOP 
Al,1,Al1 


Ao,B6,A5 
IND 18H ABS) 
A5,A8,A8 
B5,B8,B8 


Ao,B6,A5 
AG Bao 
A5,A8,A8 
B5,B8,B8 


Ao,B6,A5 
INT 18H 7B) 
A5,A8,A8 
B5,B8,B8 


Ao,B6,A5 
IAG ple 78S 
A5,A8,A8 
B5,B8,B8 


Ao,B6,A5 
A7,B7,B5 
A5,A8,A8 
B5,B8,B8 


A5,A8,A8 
B5,B8,B8 


A5,A8,A8 
B5,B8,B8 


A5,A8,A8 
B5,B8,B8 


A5,A8,A8 
B5,B8,B8 


pRERRRKKKR load ai 
pREKRRRKKEKR load bi 


a eaK pi = a0 
***x* pil = al 
; sum0O += (ai 
; suml += (aitl 


; pi = a0 b0 
 jontil = gul Jol 


= som) 4 (al 

; suml += (aitl 
Pp jst = a0 Is) 

G joalil = gul Jol 
; sum0 += (ai 

; suml += (aitl 
Pp jst = 20) Is) 

f jonlil = guil  joill 
; sum0 += (ai 

; suml += (aitl 
Pp jst = a0) 190) 

; pil = al bl 
; sum0 += (ai 

> suml += (aati 
; pi = a0 b0 

B jontil = gul  ioil 
; sum0 += (ai 

; suml += (aitl 
; sum0 += (ai 

; suml += (ait+l 
; sum0 += (ai 

; suml += (ait+l 
; sum0 += (ai 

; suml += (ait+l 
; sum0 += (ai 

; suml += (aitl 


bO 


bi) 


& al 
& bi 


bi+1) 
7; ***** branch to loop 
7; xx*x*x*x* decrement loop counter 


bi) 


bast 1) 


bi) 


lojalapiL)) 


bi) 


loyalapil)) 


bi) 


lojtarik)) 


bi) 
bid 
bi) 
bid 
bi) 
bid 
bi) 
bid 
bi) 
bid 


+ 1 from memory 
+ 1 from memory 


ADDSPs 


@ 


© 


@ 


© 


© ei FOr fm 


MPYSPs 


© 
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Example 6-25. Assembly Code for Floating-Point Dot Product (Software Pipelined 
With No Extraneous Loads) (Continued) 


ADDSP .L1xX A8,B8,A0 ; sum(0) = sum0(0) + suml (0) 
ADDSP .L2X A8,B8,BO ; sum(1) = sum0(1) + sumil(1) 
ADDSP .L1X A8,B8,A0 ; sum(2) = sum0(2) + suml (2) 
ADDSP -L2X A8,B8,BO ; sum(3) = sum0(3) + sumil (3) 
NOP ; wait for BO 

ADDSP -Lix AO,BO,A5 ; sum(01) = sum(0) + sum(1) 
NOP ; wait for next BO 

ADDSP .L2X AO,BO,B5 ; sum(23) = sum(2) + sum(3) 
NOP 3 

ADDSP ~Lix A5,B5,A4 ; sum = sum(01) + sum(23) 
NOP 3 ; 
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6.4.3.4 Priming the Loop 


Although Example 6-24 and Example 6-25 execute as fast as possible, the 
code size can be smaller without significantly sacrificing performance. To help 
reduce code size, you can use a technique called priming the loop. Assuming 
that you can handle extraneous loads, start with Example 6-22 or 
Example 6-23, which do not have epilogs and, therefore, contain fewer 
instructions. (This technique can be used equally well with Example 6—24 or 
Example 6-25.) 


Fixed-Point Example 


To eliminate the prolog of the fixed-point dot product and, therefore, the extra 
LDW and MPY instructions, begin execution at the loop body (at the LOOP 
label). Eliminating the prolog means that: 


1 Two LDWs, two MPYs, and two ADDs occur in the first execution cycle of 
the loop. 


1 Because the first LDWs require five cycles to write results into a register, 
the MPYs do not multiply valid data until after the loop executes five times. 
The ADDs have no valid data until after seven cycles (five cycles for the 
first LDWs and two more cycles for the first valid MPYs). 


Example 6-26 shows the loop without the prolog but with four new instructions 
that zero the inputs to the MPY and ADD instructions. Making the MPYs and 
ADDs use Os before valid data is available ensures that the final accumulator 
values are unaffected. (The loop counter is initialized to 57 to accommodate 
the seven extra cycles needed to prime the loop.) 


Because the first LDWs are not issued until after seven cycles, the code in 
Example 6—26 requires a total of 65 cycles (7 + 57+ 1). Therefore, you are re- 
ducing the code size with a slight loss in performance. 
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Example 6-26. Assembly Code for Fixed-Point Dot Product (Software Pipelined With 
Removal of Prolog and Epilog) 


MVK -S1 57,Al ; set up loop counter 
{Al] SUB ool Al,1,Al1 ; decrement loop counter 
ZERO -L1 A7 ; zero out sum0 accumulator 
ZERO ~L2 B7 ; zero out suml accumulator 
[Al] SUB wal A1,1,Al1 7* decrement loop counter 
[Al] B 22 LOOP ; branch to loop 
ZERO »L1 A6 ; zero out add input 
ZERO Paine Bo ; zero out add input 
[Al] SUB 04k Al1,1,Al1 7** decrement loop counter 
[Al] B «OZ LOOP 7* branch to loop 
ZERO sb A2 ; zero out mpy input 
ZERO Pale B2 ; zero out mpy input 
Al] SUB “Sul: A1,1,Al1 7*** decrement loop counter 
Al] B “Oe LOOP 7** branch to loop 
Al] SUB soa A1,1,Al1 ;***x* decrement loop counter 
Al] B -S2 LOOP 7*** branch to loop 
Al] SUB eo. Al1,1,Al1 7 **x*x** decrement loop counter 
Al] B .o2 LOOP 7**** branch to loop 
LOOP 
ADD yt A6,A7,A7 ; sum0 += (ai * bi) 
|| ADD Pe B6,B7,B7 ; suml += (ait+l * bitl) 
|| MPY -M1X A2,B2,A6 7** al * bi 
|| MPYH -M2X A2,B2,B6 7** aitl * bitl 
| | [Al] SUB “ou, Al,1,Al1 7 **x**** decrement loop counter 
|| [Al] B «OZ LOOP 7***** branch to loop 
|| LDW -D1 *A4++,A2 7 xxx Td ai & ait+l fm memory 
|| LDW -D2 *B4++,B2 pxxxeeee Td bi & bit+tl fm memory 
; Branch occurs here 
ADD .L1X A7,B7,A4 ; sum = sum0 + suml 


6-48 


Software Pipelining 


Floating-Point Example 


To eliminate the prolog of the floating-point dot product and, therefore, the 
extra LDDW and MPYSP instructions, begin execution at the loop body (at the 
LOOP label). Eliminating the prolog means that: 


L1 Two LDDWs, two MPYSPs, and two ADDSPs occur in the first execution 
cycle of the loop. 


1 Because the first LDDWs require five cycles to write results into a register, 
the MPYSPs do not multiply valid data until after the loop executes five 
times. The ADDSPs have no valid data until after nine cycles (five cycles 
for the first LDDWs and four more cycles for the first valid MPYSPs). 


Example 6-27 shows the loop without the prolog but with four new instructions 
that zero the inputs to the MPYSP and ADDSP instructions. Making the 
MPYSPs and ADDSPs use 0s before valid data is available ensures that the 
final accumulator values are unaffected. (The loop counter is initialized to 59 
to accommodate the nine extra cycles needed to prime the loop.) 


Because the first LDDWs are not issued until after nine cycles, the code in 
Example 6-27 requires a total of 81 cycles (7 + 59+ 15). Therefore, you are 
reducing the code size with a slight loss in performance. 


Example 6-27. Assembly Code for Floating-Point Dot Product (Software Pipelined With 
Removal of Prolog and Epilog) 


MVK -S1 59,Al ; set up loop counter 
ZERO L Al ; zero out mpysp input 
ZERO L2 B7 ; zero out mpysp input 
{Al] SUB A1l,1,Al1 ; decrement loop counter 
[Al] B Oe LOOP ; branch to loop 
[Al] SUB Ss Al,1,Al1 ;* decrement loop counter 
ZERO -L1 A8 ; zero out sum0 accumulator 
ZERO «lie B8 j; zero out sum0 accumulator 
Al] B G2 LOOP 7* branch to loop 
i SUB «Bl. Al,1,Al1 ;** decrement loop counter 
ZERO -L1 A5 ; zero out addsp input 
ZERO -L2 B5 ; zero out addsp input 
[Al] B “S82 LOOP 7** branch to loop 
[Al] SUB Si Al1,1,Al ;*** decrement loop counter 
ZERO -L1 A6 ; zero out mpysp input 
ZERO -L2 B6 ; zero out mpysp input 
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Example 6-27. Assembly Code for Floating-Point Dot Product (Software Pipelined With 
Removal of Prolog and Epilog) (Continued) 


’ 


[Al] 
[Al] 
Branch 


B 52 
SUB sol 
B «S2 
SUB Sl 
LDDW sDL 
LDDW ~D2 
MPYSP -M1X 
MPYSP »M2X 
ADDSP LL 
ADDSP »L2 
B «52 
SUB sol 
occurs here 
ADDSP LIX 
ADDSP L2X 
ADDSP L1x 
ADDSP L2X 
NOP 

ADDSP -L1X 
NOP 

ADDSP » L2X 
NOP 

ADDSP -L1X 
NOP 


LOOP 
Al,1,Al 


LOOP 
Al1,1,Al1 


A4++,A7:A6 
B4++,B7:B6 
A6,B6,A5 
A7,B7,B5 
A5,A8,A8 
B5,B8,B8 
LOOP 
Al1,1,Al1 


A8,B8,A0 


A8,B8,B0O 


A8,B8,A0 


A8,B8,B0 


AO,BO,A5 


AO,BO,B5 


3 


A5,B5,A4 


3 


if 


*** branch to loop 


;**** decrement loop 


, 


**x** branch to loop 


;**x*** decrement loop 


pRexkeKKKEKK Load ai & 
peeeKKKKK Load bi & bi + 1 from memory 
;**** pi = a0 b0 


p RRR pil 


, 


rf 


=al bl 
sum0 += (ai bi) 
suml += (aitl bit+l 


puRex*X* Dranch to loop 
;xxx*x** decrement loop counter 


sum(0) = sum0(0) + 
sum(1) = sum0(1) + 
sum(2) = sum0(2) + 
sum(3) = sum0(3) + 


wait for BO 


sum(01) = sum(0) + 


wait for next BO 


sum(23) = sum(2) + 


counter 


counter 


ai + 1 from memory 


) 


suml (0) 


suml (1) 


suml (2) 


suml (3) 


sum(1) 


sum (3) 


sum = sum(01) + sum(23) 
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6.4.3.5 Removing Extra SUB Instructions 


To reduce code size further, you can remove extra SUB instructions. If you 
know that the loop count is at least 6, you can eliminate the extra SUB instruc- 
tions as shown in Example 6-28 and Example 6-29. The first five branch 
instructions are made unconditional, because they always execute. (If you do 
not know that the loop count is at least 6, you must keep the SUB instructions 
that decrement before each conditional branch as in Example 6-26 and 
Example 6-27.) Based on the elimination of six SUB instructions, the loop 
counter is now 51 (57 — 6) for the fixed-point dot product and 53 (59 — 6) for 
the floating-point dot product. This code shows some improvement over 
Example 6-26 and Example 6—27. The loop in Example 6—28 requires 63 
cycles (5 + 57 + 1) and the loop in Example 6-27 requires 79 cycles 
(5+ 59 + 15). 


Example 6-28. Assembly Code for Fixed-Point Dot Product (Software Pipelined 
With Smallest Code Size) 


B ~52 LOOP ; branch to loop 
1 | MVK aol 51,A1 ; set up loop counter 
B ~52 LOOP 7* branch to loop 
B 282 LOOP 7;** branch to loop 
ZERO Pal A7 ; zero out sum0 accumulator 
ZERO ~L2 B? ; zero out suml accumulator 
B <o2 LOOP 7*** branch to loop 
ZERO .L1 A6é ; zero out add input 
ZERO ee Bo ; zero out add input 
B soz LOOP 7**** branch to loop 
ZERO -L1 A2 ; zero out mpy input 
ZERO ~L2 B2 ; zero out mpy input 
LOOP 

ADD -L1 A6,A7,A7 ; sumO += (ai * bi) 
ADD ~L2 B6,B7,B7 ; suml += (ait+l * bitl) 
MPY .-M1X A2,B2,A6 p** ai * bi 
MPYH .M2X A2,B2,B6 7** aitl * bitl 

[Al] SUB sok Al,1,Al1 7 xx*x*x*** decrement loop counter 

{Al] B “oe LOOP pxeeee Hranch to loop 
LDW .D1 *A4++,A2 7 xxxxee* Td ai & aitl fm memory 
LDW Pa BWA *B4++,B2 p*xeeeex Td bi & bit+l fm memory 
; Branch occurs here 
ADD .L1X A7,B7,A4 ; sum = sum0 + suml 
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Example 6-29. Assembly Code for Floating-Point Dot Product (Software Pipelined 


With Smallest Code Size) 


B «82 LOOP ; branch to loop 
MVK -S1 53,Al1 ; set up loop counter 
B 52 LOOP 7* branch to loop 
ZERO -L1 A7 ; zero out mpysp input 
ZERO 12 B7 ; zero out mpysp input 
B sS2 LOOP 7** branch to loop 
ZERO re we A8 j; zero out sum0 accumulator 
ZERO 2 B8 ; zero out sum0 accumulator 
B »S2 LOOP 7*** branch to loop 
ZERO LL A5 ; zero out addsp input 
ZERO 12 B5 ; zero out addsp input 
B .52 LOOP 7 **** branch to loop 
ZERO -L1 A6 ; zero out mpysp input 
ZERO 12 B6 ; zero out mpysp input 
LOOP: 

LDDW D1 A4++,A7:A6 pexxeeeex*X Load ai & ai + 1 from memory 
LDDW D2 B4++,B7:B6 p*xxeeKKKX Load bi & bi + 1 from memory 
MPYSP 1X A6,B6,A5 **** pi = a0 b0 
MPYSP 2X A7,B7,B5 **** pil = al bl 
ADDSP .L1 A5,A8,A8 ; sum0 += (ai bi) 
ADDSP .L2 B5,B8,B8 ; suml += (ai+l bitl) 

{Al] B «82 LOOP 7; ***** branch to loop 

[Al] SUB «ol Al,1,Al1 ;xxx*x*x** decrement loop counter 

; Branch occurs here 

ADDSP .L1X A8,B8,A0 ; sum(0) = sum0(0) + suml (0) 
ADDSP L2X A8,B8,BO ; sum(1) = sum0(1) + suml(1) 
ADDSP L1xX A8,B8,A0 ; sum(2) = sum0 (2) suml (2) 
ADDSP L2X A8,B8,BO ; sum(3) = sum0 (3) suml (3) 
NOP ; wait for BO 
ADDSP .L1X AO,BO,A5 ; sum(01) = sum(0) + sum(1) 
NOP j; wait for next BO 
ADDSP .L2X AO,BO,B5 ; sum(23) = sum(2) + sum(3) 
NOP 3 
ADDSP .L1X A5,B5,A4 ; sum = sum(01) + sum(23) 
NOP 3 , 


6-52 


Software Pipelining 


6.4.4 Comparing Performance 


Table 6-10. 


Code Example 


Example 6—5 
Example 6-6 
Example 6-15 
Example 6-22 


Example 6-24 


Example 6-26 


Example 6-28 


Table 6-11. 


Code Example 


Example 6—7 
Example 6-8 
Example 6-16 
Example 6-23 


Example 6-25 


Example 6-27 


Example 6-29 


Table 3-2 compares the performance of all versions of the fixed-point dot 
product code. Table 6—11 compares the performance of all versions of the 
floating-point dot product code. 


Comparison of Fixed-Point Dot Product Code Examples 


100 Iterations Cycle Count 
Fixed-point dot product linear assembly 2+ 100 x 16 1602 
Fixed-point dot product parallel assembly 1+100 x 8 801 
Fixed-point dot product parallel assembly with LDW 1+(50 x 8) +1 402 
Fixed-point software-pipelined dot product 7+50+1 58 
Fixed-point software-pipelined dot product with no extrane- 7+434+741 58 
ous loads 
Fixed-point software-pipelined dot product with no prolog or 7+57+1 65 
epilog 
Fixed-point software-pipelined dot product with smallest 5+57+1 63 
code size 
Comparison of Floating-Point Dot Product Code Examples 

100 Iterations Cycle Count 
Floating-point dot product nonparallel assembly 2+ 100 x 21 2102 
Floating-point dot product parallel assembly 1+100 x 10 1001 
Floating-point dot product parallel assembly with LDDW 1+(50 x 10) +7 508 
Floating-point software-pipelined dot product 9+50+15 74 
Floating-point software-pipelined dot product with no extra- 94+414+9+15 74 
neous loads 
Floating-point software-pipelined dot product with no prolog 7+594+15 81 
or epilog 
Floating-point software-pipelined dot product with small- 5+594+15 79 


est code size 
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6.5 Modulo Scheduling of Multicycle Loops 


Section 6.4 demonstrated the modulo-scheduling technique for the dot 
product code. In that example of a single-cycle loop, none of the instructions 
used the same resources. Multicycle loops can present resource conflicts 
which affect modulo scheduling. This section describes techniques to deal 
with this issue. 


6.5.1 Weighted Vector Sum C Code 


Example 6-30 shows the C code for a weighted vector sum. 


Example 6-30. Weighted Vector Sum C Code 


void w_vec(short a[],short b[],short c[],short m) 


{ 


Ln. 2¢ 


for (i=0; i<100; i++) { 
efi] = ((m * ali]) => 15) + bial? 
} 


6.5.2 Translating C Code to Linear Assembly 


Example 6-31 shows the linear assembly that executes the weighted vector 
sum in Example 6-30. This linear assembly does not have functional units as- 
signed. The dependency graph will help in those decisions. However, before 
looking at the dependency graph, the code can be optimized further. 


Example 6-31. Linear Assembly for Weighted Vector Sum Inner Loop 


LDH xaptrt+t+,ai jal 

LDH *bpptrt++,bi ; bi 

MPY m,ai,pi 7 om * ai 

SHR pi,15,pi_scaled 3; (m * ai) >> 15 

ADD pi_scaled,bi,ci 3; ci = (m * ai) >> 15 + bi 

STH G1, *eptr++ ; store ci 
[centr] SUB entr, 1, cntr ; Gecrement loop counter 
[cntr]B LOOP ; branch to loop 
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6.5.3 Determining the Minimum Iteration Interval 


Example 6-31 includes three memory operations in the inner loop (two LDHs 
and the STH) that must each use a .D unit. Only two .D units are available on 
any single cycle; therefore, this loop requires at least two cycles. Because no 
other resource is used more than twice, the minimum iteration interval for this 
loop is 2. 


Memory operations determine the minimum iteration interval in this example. 
Therefore, before scheduling this assembly code, unroll the loop and perform 
LDWs to help improve the performance. 


6.5.3.1 Unrolling the Weighted Vector Sum C Code 


Example 6-32 shows the C code for an unrolled version of the weighted vector 
sum. 


Example 6-32. Weighted Vector Sum C Code (Unrolled) 


void w_vec(short a[],short b[],short c[],short m) 


{ 


int 


for 


i; 

(i=0; i<100; it=2) { 

cli] = ((m * a[i]) >> 15) + b[il; 
cfitl] = ((m * a[itl]) >> 15) + b[itl]; 
} 
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6.5.3.2 Translating Unrolled Inner Loop to Linear Assembly 


Example 6-33 shows the linear assembly that calculates c[i] and c[i+1] for the 
weighted vector sum in Example 6-32. 


L 


Lj 
a 


The two store pointers (*ciptr and *ci+1ptr) are separated so that one 
(*ciptr) increments by 2 through the odd elements of the array and the 
other (*ci+1ptr) increments through the even elements. 


AND and SHR separate bi and bi+1 into two separate registers. 


This code assumes that mask is preloaded with OxOOOOFFFF to clear the 
upper 16 bits. The shift right of 16 places bi+1 into the 16 LSBs. 


Example 6-33. Linear Assembly for Weighted Vector Sum Using LDW 


[centr] SUB 
[centr] B 


*aptrt++,ai_itl 
*bptrt++,bi_itl 
m,ai_it+l,pi 
m,ai_i+l,pitl 
pi,15,pi_scaled 
piti1,15,pitl_scaled 
bi_itl,mask,bi 
bi_it1l,16,bitl1 
pi_scaled,bi,ci 
pi+l_scaled,bi+1,citl 
Cl, *eiptr++ [2] 
citl, *citlptr++[2] 
cntr, 1, centr 


LOOP 


; ail & aitl 
; bi & bitl 


(m * ai) >> 15 
(m * aitl) >> 15 


(m * ai) >> 15 + bi 

; citl = (m * aitl) >> 15 + bitl 
store ci 

; store citl 

; decrement loop counter 

; branch to loop 


os ial NO, NSN NS, Ne OR Nii Nd Ne Sas SS 
oO 
ofl a 
| 


6.5.3.3 Determining a New Minimum Iteration Interval 
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Use the following considerations to determine the minimum iteration interval 
for the assembly instructions in Example 6-33: 


Ly 


Ly 


Four memory operations (two LDWs and two STHs) must each use a .D 
unit. With two .D units available, this loop still requires only two cycles. 


Four instructions must use the .S units (three SHRs and one branch). With 
two .S units available, the minimum iteration interval is still 2. 


The two MPYs do not increase the minimum iteration interval. 


Because the remaining four instructions (two ADDs, AND, and SUB) can 
all use a .L unit, the minimum iteration interval for this loop is the same as 
in Example 6-31. 


By using LDWs instead of LDHs, the program can do twice as much work in 
the same number of cycles. 


Modulo Scheduling of Multicycle Loops 


6.5.4 Drawing a Dependency Graph 


To achieve a minimum iteration interval of 2, you must put an equal number 
of operations per unit on each side of the dependency graph. Three operations 
in one unit on a side would result in an minimum iteration interval of 3. 


Figure 6-11 shows the dependency graph divided evenly with a minimum it- 
eration interval of 2. 
Figure 6—11. Dependency Graph of Weighted Vector Sum 


A side 
LDW 


! B side 
| 
D1 | 


MPY MPYHL 
/ 


R 


Optimizing Assembly Code via Linear Assembly 6-57 


Part Ill 


Part Ill 


Modulo Scheduling of Multicycle Loops 


6.5.5 Linear Assembly Resource Allocation 


Using the dependency graph, you can allocate functional units and registers 
as shown in Example 6-34. This code is based on the following assumptions: 


Lj The pointers are initialized outside the loop. 
_} m resides in B6, which causes both .M units to use a cross path. 
[J The mask in the AND instruction resides in B10. 


Example 6-34. Linear Assembly for Weighted Vector Sum With Resources Allocated 


LDW Pa oil *A44++,A2 ; ai & aitl 
LDW .D2 *B4++,B2 ; bi & bitl 
MPY -M1X A2,B6,A5 7 pl =m * al 
MPYHL .M2X A2,B6,B5 j pitl =m * aitl 
SHR 5S, A5,15,A7 ; pi_scaled = (m * ai) >> 15 
SHR -S2 B5,15,B7 ; pitl_scaled = (m * ait+l) >> 15 
AND -L2 B2,B10,B8 ; bi 
SHR -S2 B2,16,Bl1 Fae opin ol 
ADD .L1X A7,B8,A9 7; ci = (m * ai) >> 15 + bi 
ADD LZ B7,B1,B9 ; citl = (m * aitl) >> 15 + bitl 
STH pel, AQ, *A6++[2] ; store ci 
STH -D2 B9, *BO++[2] ; store cit+l 
[Al] SUB ~L1 Al,1,Al1 ; decrement loop counter 
, 


{Al] B Sk LOOP branch to loop 


6.5.6 Modulo Iteration Interval Scheduling 


Table 6-12 provides a method to keep track of resources that are a modulo 
iteration interval away from each other. In the single-cycle dot product exam- 
ple, every instruction executed every cycle and, therefore, required only one 
set of resources. Table 6—12 includes two groups of resources, which are 
necessary because you are scheduling a two-cycle loop. 


(1 Instructions that execute on cycle k also execute on cycle k + 2, k + 4, etc. 
Instructions scheduled on these even cycles cannot use the same 
resources. 


(J Instructions that execute on cycle k + 1 also execute on cycle k + 3, k + 5, 
etc. Instructions scheduled on these odd cycles cannot use the same 
resources. 


(J Because two instructions (MPY and ADD) use the 1X path but do not use 
the same functional unit, Table 6-12 includes two rows (1X and 2X) that 
help you keep track of the cross path resources. 
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Only seven instructions have been scheduled in this table. 


Lj] The two LDWs use the .D units on the even cycles. 


LJ} The MPY and MPYH are scheduled on cycle 5 because the LDW has four 
delay slots. The MPY instructions appear in two rows because they use 
the .M and cross path resources on cycles 5, 7, 9, etc. 


1 The two SHR instructions are scheduled two cycles after the MPY to allow 
for the MPY’s single delay slot. 


_j The AND is scheduled on cycle 5, four delay slots after the LDW. 


Optimizing Assembly Code via Linear Assembly 6-59 


Part Ill 


Part Ill 


Modulo Scheduling of Multicycle Loops 


Table 6-12. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) 


LDW ai_i+1 | LDW ai_i+1 | LDWai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 
LDW bi_i+1 | LOW bi_i+1 | LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 
a a es ee es 
a | 
S1 
.S2 
1X 
2X 
Unit/Cycle 1 3 5 7 | go 11 
Se a ee eee | eee ee 
ey ff; ff 86S, 
M1 : : ; ‘ 
MPY pi MPY pi MPY pi MPY pi 
.M2 , ; ; ‘ 
MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 
a AND bi AND bi 
.L2 
.S2 . ; : 
SHR pi+1_s SHR pi+1_s SHR pi+1_s 
1X f : : : 
MPY pi MPY pi MPY pi MPY pi 
2X : : ; : 
MPYHL pi+1 | MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 


Note: 
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The asterisks indicate the iteration of the loop; shaded cells indicate cycle 0. 
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6.5.6.1 Resource Conflicts 


Resources from one instruction cannot conflict with resources from any other 
instruction scheduled modulo iteration intervals away. In other words, for a 
2-cycle loop, instructions scheduled on cycle n cannot use the same resources 
as instructions scheduled on cycles n+ 2,n+4,n+6, etc. Table 6-13 shows 
the addition of the SHR bi+1 instruction. This must avoid a conflict of resources 
in cycles 5 and 7, which are one iteration interval away from each other. 


Even though LDW bi_i+1 (.D2, cycle 0) finishes on cycle 5, its child, SHR bi+1, 
cannot be scheduled on .S2 until cycle 6 because of a resource conflict with 
SHR pi+1_scaled, which is on .S2 in cycle 7. 


Figure 6—12. Dependency Graph of Weighted Vector Sum (Showing Resource Conflict) 
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Table 6-13. Modulo Iteration Interval Table for Weighted Vector Sum With SHR 


Part Ill 


Instructions 
LDW ai_i+1 | LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 
LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 
SHR bi+1 SHR bi+1 SHR bi+1 
9 11, 13, 15, ... 
MPY pi MPY pi 
MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 
AND bi AND bi AND bi 
SHR pi_s SHR pi_s SHR pi_s 
SHR pi+1_s SHR pi+1_s SHR pi+1_s 
MPY pi MPY pi MPY pi 


Note: 
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The asterisks indicate the iteration of the loop; shading indicates changes in scheduling from Table 6—12. 
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pom fof MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 


Note: The asterisks indicate the iteration of the loop; shading indicates changes in scheduling from Table 6-12. 
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6.5.6.2 Live Too Long 


Scheduling SHR bi+1 on cycle 6 now creates a problem with scheduling the 
ADD ci instruction. The parents of ADD ci (AND bi and SHR pi_scaled) are 
scheduled on cycles 5 and 7, respectively. Because the SHR pi_scaled is 
scheduled on cycle 7, the earliest you can schedule ADD ci is cycle 8. 


However, in cycle 7, AND bi * writes bi for the next iteration of the loop, which 
creates a scheduling problem with the ADD ci instruction. If you schedule 
ADD cion cycle 8, the ADD instruction reads the parent value of bi for the next 
iteration, which is incorrect. The ADD ci demonstrates a live-too-long problem. 


No value can be live in a register for more than the number of cycles in the loop. 
Otherwise, iteration n + 1 writes into the register before iteration n has read that 
register. Therefore, in a 2-cycle loop, a value is written to a register at the end 
of cycle n, then all children of that value must read the register before the end 
of cycle n + 2. 


6.5.6.3 Solving the Live-Too-Long Problem 
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The live-too-long problem in Table 6-13 means that the bi value would have 
to be live from cycles 6-8, or 3 cycles. No loop variable can live longer than 
the iteration interval, because a child would then read the parent value for the 
next iteration. 


To solve this problem move AND bi to cycle 6 so that you can schedule ADD ci 
to read the correct value on cycle 8, as shown in Figure 6-13 and Table 6-14. 


Modulo Scheduling of Multicycle Loops 


Figure 6—13. Dependency Graph of Weighted Vector Sum (With Resource Conflict 
Resolved) 
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Note: Shaded numbers indicate the cycle in which the instruction is first scheduled. 
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Table 6-14. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) 
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8 10 
LDW ai_i+1 | LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 
LDW bi_i+1 | LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 
ADD ci ADD ci 

AND bi AND bi 

SHR bi+1 SHR bi+1 

9 11 

MPY pi MPY pi MPY pi 
MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 

SHR pi_s SHR pi_s SHR pi_s 
SHR pi+1_s SHR pi+1_s SHR pi+1_s 

MPY pi MPY pi 
ret MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 


Note: 
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The asterisks indicate the iteration of the loop; shading indicates changes in scheduling from Table 6-13. 
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6.5.6.4 Scheduling the Remaining Instructions 
Figure 6-14 shows the dependency graph with additional scheduling 


changes. The final version of the loop, with all instructions scheduled correctly, 
is shown in Table 6-15. 


Figure 6—14. Dependency Graph of Weighted Vector Sum (Scheduling ci +1) 
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Note: Shaded numbers indicate the cycle in which the instruction is first scheduled. 
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Table 6-15 shows the following additions: 


DODO 


To 


B LOOP (.S1, cycle 6) 
SUB cnir (.L1, cycle 5) 
ADD ci+1 (.L2, cycle 10) 
STH ci (cycle 9) 

STH ci+1 (cycle 11) 


avoid resource conflicts and live-too-long problems, Table 6-15 also 


includes the following additional changes: 


UU 


Uo 


LDW bi_i+1 (.D2) moved from cycle 0 to cycle 2. 

AND bi (.L2) moved from cycle 6 to cycle 7. 

SHR pi+1_scaled (.S2) moved from cycle 7 to cycle 9. 
MPYHL pi+1 moved from cycle 5 to cycle 6. 

SHR bi+1 moved from cycle 6 to 8. 


From the table, you can see that this loop is pipelined six iterations deep, be- 
cause iterations n and n + 5 execute in parallel. 
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Table 6-15. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) 


LDW ai_i+1 | LDW ai_i+1 LDW re LDW Fal lat LDW ai_i+1 LDW ai_i+1 
= 4 i+1 | LDW ai i+1 LDW a i+1 LDW bi_i+1 LDW a i+1 


ae a to 
a er 
Ee ee 
ps | | || tor | ercor | ator 
ee ee 
a ee ee 
rite | wri et rite 


ef | || i 


SUB cnitr SUB centr 


* kk 


AND bi AND bi 
SHR pi_s SHR pi_s 


SHR pi+1_s | SHRpi+1_s 


Note: The asterisks indicate the iteration of the loop; shading indicates changes in scheduling from Table 6-14. 
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6.5.7 Using the Assembly Optimizer for the Weighted Vector Sum 


Example 6-35 shows the linear assembly code to perform the weighted vector 
sum. You can use this code as input to the assembly optimizer to create a soft- 
ware-pipelined loop instead of scheduling this by hand. 


Example 6-35. Linear Assembly for Weighted Vector Sum 


-global _w_vec 
_w_vec: .cproc a, b, c, m 
.reg ai il, ba tly, pl, pil, pial, pics, pilos 
.reg mask, bi, bil, ci, etl, cl, centr 
MVK -1,mask ; set to all 1s to create OxFFFFFFFF 
MVKH 0,mask ; clear upper 16 bits to create OxFFFF 
MVK 50,cntr > entr = 100/2 
ADD Pu orneri ; point to c[1] 
LOOP: -trip 50 
LDW .D1 *att+,ai_il ; ai & aitl 
LDW .D2 *ot++,bi_il ; bi & bitl 
MPY .M1X ai_il,m,pi 7; m* ai 
MPYHL .M2X ai_il,m,pil 7; m * aitl 
SHR sik pi,15,pi_s >; (m * ai) >> 15 
SHR ~S2 pil,15,pil_s ; (m * ait+l) >> 15 
AND ~L2 bi_il,mask,bij; bi 
SHR ~o2 bi_il,16,bil ; bitl 
ADD .L1X pi_s,bi,ci ; ci = (m * ai) >> 15 + bi 
ADD .L2 pil_s,bil,cil; cit+l = (m * ait+l) >> 15 + bitl 
STH sD oi, *or+ [2] ; store ci 
STH D2 e11,*cl++[2] 7 store citi 
[entr] SUB -L1 centr, 1l,centr ; decrement loop counter 
[centr] B sgl LOOP ; branch to loop 
-endproc 
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6.5.8 Final Assembly 


Example 6-36 shows the final assembly code for the weighted vector sum. 
The following optimizations are included: 


(1 While iteration n of instruction STH ci+1 is executing, iteration n + 1 of 
STH ciis executing. To prevent the STH ci instruction from executing itera- 
tion 51 while STH ci + 1 executes iteration 50, execute the loop only 49 
times and schedule the final executions of ADD ci+1 and STH ci+1 after 
exiting the loop. 


[1 The mask for the AND instruction is created with MVK and MVKH in paral- 
lel with the loop prolog. 


Lj The pointer to the odd elements in array c is also set up in parallel with the 
loop prolog. 
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Example 6-36. Assembly Code for Weighted Vector Sum 


LDW D1 *A4++,A2 

ADD .L2X A6,2,B0 

LDW .D2 *B4++,B2 
| | LDW D1 *A4++,A2 

MVK .S2 -1,B10 

LDW .D2 *B4++,B2 

LDW .D1 *A4++,A2 

MVK .S1 49,Al 

MVKH .S2 0,B10 

MPY .M1X A2,B6,A5 

[Al] SUB Pay Al,1,Al 

MPYHL .M2X A2,B6,B5 

[Al] B .S1 LOOP 

LDW .D2 *B4++,B2 

LDW .D1 *A4++,A2 

SHR .S1 A5,15,A7 

AND LZ B2,B10,B8 

MPY .M1X A2,B6,A5 

[Al] SUB a. Al,1,Al 

SHR .S2 B2,16,Bl 

ADD .L1X  A7,B8,A9 

MPYHL .M2X A2,B6,B5 

[Al] B .S1 LOOP 

LDW .D2 *B4++,B2 

LDW .D1 *A4++,A2 

SHR .S2 B5,15,B7 

STH D1 AQ, *A6++[2] 

SHR .S1 A5,15,A7 

AND .L2 B2,B10,B8 

[Al] SUB .L1 Al,1,Al1 

MPY .M1X A2,B6,A5 
LOOP: 

ADD .L2 B7,B1,B9 
| | SHR .S2 B2,16,Bl 
| | ADD .L1X  A7,B8,A9 
| | MPYHL .M2X A2,B6,B5 
|| [A1] B .S1 LOOP 
| | LDW .D2 *B4++,B2 
| | LDW .D1 *A4++,A2 


, 


’ 


, 


, 


al & ait+l 
set pointer to citl 


bi & bitl 
* ai & ait+l 
set to all ls (OxFFFFFFFF) 


* bi & bitl 


;** ai & aitl 


set up loop counter 
clr upper 16 bits (Ox0000FFFF) 


m * ai 
decrement loop counter 


m * ait+l 
branch to loop 
** bi & bitl 


7*** ai & aitl 


(m * ai) >> 15 
bi 


7* m * ai 


, 


, 


;* decrement loop counter 


bitl 


ci = (m * ai) >> 15 + bi 


7* m * aitl 


, 


;* branch to loop 


7*** bi & bitl 
7**** ai & aitl 


, 


, 


, 


, 


, 


(m * ait+l) >> 15 
store ci 
2 (m * ai) >> 15 


-* bi 


** decrement loop counter 
xk m * al 


cit+l = (m * aitl) >> 15 + bitl 
cae ole, oll 
;* ci = (m * ai) >> 15 + bi 


7** m * aitl 


** branch to loop 


pee bd. & bat 
pRERAH at & adtl 
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Example 6-36. Assembly Code for Weighted Vector Sum (Continued) 


PY 


.D2 
BOD 
-D1 
Peck 
.L2 
-L1 
-M1X 
; Branch occurs here 


L2 


D2 


B9, *BO++[2] 
B5,15,B7 
AQ, *A6++[2] 
A5,15,A7 
B2,B10,B8 
Al,1,Al1 
A2,B6,A5 


B7,B1,B9 


B9, *BO 


; store citl 

7* (m * aitl) >> 15 

;* store ci 

eee (mo * ai) > 15 

see bi 

;*** decrement loop counter 
a*e* m * ai 


; Cit+tl = (m * aitl) >> 15 + bitl 


; store citl 
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6.6 Loop Carry Paths 


Loop carry paths occur when one iteration of a loop writes a value that must 
be read by a future iteration. A loop carry path can affect the performance of 
a software-pipelined loop that executes multiple iterations in parallel. Some- 
times loop carry paths (instead of resources) determine the minimum iteration 
interval. 


IIR filter code contains a loop carry path; output samples are used as input to 
the computation of the next output sample. 
6.6.1 IIR Filter C Code 


Example 6-37 shows C code for a simple IIR filter. In this example, y[i] is an 
input to the calculation of y[i+1]. Before y[i] can be read for the next iteration, 
y[i+1] must be computed from the previous iteration. 


Example 6-37. IIR Filter C Code 


void iir(short x[],short y[],short cl, short c2, short c3) 
int: ay 


for (i=0; i<100; i++) { 
ylitl] = (cl*x[i] + c2*x[itl] + c3*y[i]) >> 15; 
} 
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6.6.2 Translating C Code to Linear Assembly (Inner Loop) 


Example 6-38 shows the ’C6x instructions that execute the inner loop of the 
IIR filter C code. In this example: 


Lj xptr is not postincremented after loading xi+1, because xi of the next 
iteration is actually xi+1 of the current iteration. Thus, the pointer points to 
the same address when loading both xi+1 for one iteration and xi for the 
next iteration. 


(1 yptr is also not postincremented after storing yi+1, because yi of the next 
iteration is yi+1 for the current iteration. 


Example 6-38. Linear Assembly for IIR Inner Loop 


LDH *xptrt++, xi . Sie 

MPY cl,xi,p0 ae ol ima ee 

LDH SSGOIE TE , Sar IL O Sater 

MPY c2,xitl1,pl ¢ C2 -® xa 

ADD p0,pl,s0 3; cl * Si + ¢2.* Bit 
LDH PE WAOIE TEESE p WAL 8 wal 


MPY C3,yvlyp2 7 3: * a 
ADD s0,p2,sl POC. *® xa +b e2 * aca te eB. YL 
SHR s1,15,yitl ; yitl 
FS) 
S 
B 


TH yit+l,*yptr 7; store yitl 
UB entr,1,cntr ; decrement loop counter 
LOOP ; branch to loop 


{entr] 
{entr] 
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6.6.3 Drawing a Dependency Graph 


Figure 6-15 shows the dependency graph for the IIR filter. A loop carry path 
exists from the store of yi+1 to the load of yi. The path between the STH and 
the LDH is one cycle because the load and store instructions use the same 
memory pipeline. Therefore, if a store is issued to a particular address on cycle 
nand aload from that same address is issued on the next cycle, the load reads 
the value that was written by the store instruction. 


Figure 6—15. Dependency Graph of IIR Filter 


A side B side 
LDH LDH LDH 


Part Ill 


Note: The shaded numbers show the loop carry path:5+2+1+1+1=10. 
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6.6.4 Determining the Minimum Iteration Interval 


To determine the minimum iteration interval, you must consider both resources 
and data dependency constraints. Based on resources in Table 6-16, the 
minimum iteration interval is 2. 


a 


Note: 


There are six non-.M units available: three on the A side (.S1, .D1, .L1) and 
three on the B side (.S2, .D2, .L2). Therefore, to determine resource 
constraints, divide the total number of non-.M units used on each side by 3 
(3 is the total number of non-.M units available on each side). 


Based on non-.M unit resources in Table 6—16, the minimum iteration inter- 
val for the IIR filter is 2 because the total non-.M units on the A side is 5 (5 = 3 
is greater than 1 so you round up to the next whole number). The B side uses 
only three non-.M units, so this does not affect the minimum iteration interval, 
and no other unit is used more than twice. 


| a) 


Table 6-16. Resource Table for IIR Filter 


(a) A side 
Unit(s) 

.M1 

S1 

.D1 

.L1,.S1, or .D1 


Total non-.M units 


(b) B side 
Instructions Total/Unit | Unit(s) Instructions Total/Unit 
2 MPYs 2 | M2 MPY 1 
B 1 .S2 SHR 1 
2 LDHs 2 .D2 STH 1 
ADD & SUB 2 .L2 or .S2, .D2 ADD 1 
5 Total non-.M units 3 


However, the IIR has a data dependency constraint defined by its loop carry 
path. Figure 6-15 shows that if you schedule LDH yi on cycle 0: 


1 The earliest you can schedule MPY p2 is on cycle 5. 


The earliest you can schedule ADD s1 is on cycle 7. 


iz 
1 SHR yi+1 must be on cycle 8 and STH on cycle 9. 
= 


Because the LDH must wait for the STH to be issued, the earliest the the 
second iteration can begin is cycle 10. 


To determine the minimum loop carry path, add all of the numbers along the 
loop paths in the dependency graph. This means that this loop carry path is 
10(6+2+1+4+1+41). 
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Although the minimum iteration interval is the greater of the resource limits and 
data dependency constraints, an interval of 10 seems slow. Figure 6-16 
shows how to improve the performance. 


6.6.4.1 Drawing a New Dependency Graph 


Figure 6-16 shows a new graph with a loop carry path of 4 (2 +1 + 1). because 
the MPY p2 instruction can read yi+1 while itis still in a register, you can reduce 
the loop carry path by six cycles. LDH yiis no longer in the graph. Instead, you 
can issue LDH y[0] once outside the loop. In every iteration after that, the y+1 
values written by the SHR instruction are valid y inputs to the MPY instruction. 


Figure 6—16. Dependency Graph of IIR Filter (With Smaller Loop Carry) 
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A side 


B side 


MPY 


' 


Note: The shaded numbers show the loop carry path:2+1+1=4. 
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6.6.4.2 New ’C6x Instructions (Inner Loop) 


Example 6-39 shows the new linear assembly from the graph in Figure 6—16, 
where LDH yi was removed. The one variable y that is read and written is yi 
for the MPY p2 instruction and yi+1 for the SHR and STH instructions. 


Example 6-39. Linear Assembly for IIR Inner Loop With Reduced Loop Carry Path 


LDH *xptr++,xi 3; xitl 
MPY cl1,xi,p0 eCl, * eak 
LDH eAPEL, MIL 2 Sel 
MPY e2, x141,p1 Y G2 * Sick 
ADD p0,pl,s0 s cl * xi + G2 * 8141 
MPY C3, W702 f 3 37ab 
ADD s0,p2,sl 7 CL * Ba toc2l * xadl $038 * Yi 
SHR Sil, la, wv 8 Ayaesrdl 
Siva Wp SWISIE esr pESEOremy ntl 
{cntr] SUB entr,1,cntr ; decrement loop counter 
{cntr]B LOOP ; branch to loop 


6.6.5 Linear Assembly Resource Allocation 


Example 6-40 shows the same linear assembly instructions as those in 
Example 6-39 with the functional units and registers assigned. 


Example 6—40. Linear Assembly for IIR Inner Loop (With Allocated Resources) 


= 

LDH .D1 *A4++,A2 >; xi¢l t 
MPY .M1 A6,A2,A5 pod. Bea: Aj 
LDH .D1.—s*A4, AB ; xitl a 
MPY .M1X  B6,A3,A7 ; c2 * xitl 
ADD ~L1 A5,A7,A9 7; cl * xi + c2 * xitl 
MPY ~-M2X A8,B2,B3 ye Se * “yal 
ADD .L2X B3,A9,B5 ‘Cl. * Se oe * ae eS * yal 
SHR .S2 B5,15,B2 ; yitl 
STH Pa BP, B2, *B4++ ; store yitl 

[Al] SUB LAL Al,1,Al1 ; decrement loop counter 

Al] B eet LOOP ; branch to loop 
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6.6.6 Modulo Iteration Interval Scheduling 


Table 6—17 shows the mod 
instruction on cycle 10 finis 


ulo iteration interval table for the IIR filter. The SHR 
hes in time for the MPY p2 instruction from the next 


iteration to read its result on cycle 11. 


Table 6-17. Modulo Iteration Interval Table for IIR (4-Cycle Loop) 


CS 


Unit/Cycle 1 | 5 9, 13, 17, ... 
{ - kk 


DI LDA x | LDH xi LDH xi 2 LDA xt | LDH xiet | LDH cit 
D2 ie | D2 
M1 M1 MPY po : 


ef 


St St 
S2 $2 
1X 1X 
| 2x | | | 2x | | ADD s1 
| 6 | 10, 14, 18, ... | Unit/Cycle 11, 15, 19, ... 
‘D1 
.D2 | D2 STH yi+1 
4 iF) 
L2 | L2 
Note: The asterisks indicate the iteration of the loop. 


6-80 


Loop Carry Paths 


6.6.7 Using the Assembly Optimizer for the IIR Filter 


Example 6—41 shows the linear assembly code to perform the IIR filter. Once 
again, you can use this code as input to the assembly optimizer to create a soft- 
ware-pipelined loop instead of scheduling this by hand. 


Example 6-41. Linear Assembly for IIR Filter 


-global _iir 


_lirt .cproc x), ‘y,; cl, <2, -¢3 
.reg Kay KD 1 yal 
.reg pO, pl, p2, sO, sl, cntr 
MVK LOO, cnt x * enter — 100 
LDH «D2 *yt+;,yil 7; yitl 
LOOP: .trip 100 
LDH -D1L *xt++,xi ¢ xa 
MPY -M1 cl,xi,p0 Be Gd, Hs «seats 
LDH «D1. *x,xi1 - mitl 
MPY -M1X c2,xil,pl ICBO, Seach: 
ADD -L1 p0,pl,s0 > Gl * xa + @2 * xitl 
MPY -M2X c3,yil,p2 i Se Fo ya 
ADD -L2X s0,p2,s1 7 ocl * wa + c2 * “141 +63 * yi 
SHR #82. SL, 15,41 ; yitl 
STH -D2 yil,*yt+t ; store yitl 
{cntr] SUB «il. centr, 1, enter ; decrement loop counter 
{cntr] B -S1 LOOP ; branch to loop 
endproc 
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6.6.8 Final Assembly 


Example 6—42 shows the final assembly for the IIR filter. With one load of y[0] 
outside the loop, no other loads from the y array are needed. Example 6-42 
requires 408 cycles: (4x 100) + 8. 


Example 6-42. Assembly Code for IIR Filter 


LDH -D1 *R4++,A2 XL 
LDH Di. *A4,A3 yp xitl 
LDH -D2 *B4++,B2 ; load y[0] outside of loop 
MVK 2Si 100,Al1 ; set up loop counter 
LDH .D1 *DA4++,A2 ;* xi 
[Al] SUB «Lid. Al,1,Al1 ; decrement loop counter 
|| MPY -M1 A6,A2,A5 Bo Cd Rea 
|| LDH .D1 *A4,A3 ;* xitl 
MPY .M1X B6,A3,A7 3 c2 * xitl 
| | [Al] B +S. LOOP ; branch to loop 
MPY .M2X A8,B2,B3 7 3ES * ya 
LOOP: 
ADD Faigil A5,A7,A9 eG. oF a G2: * RD 
LDH .D1 *A4++,A2 pee xi 
ADD ~L2X B3,A9,B5 7 Gl * sch + G2 Ral + eB Aya 
[Al] SUB -L1 Al1,1,Al1 7* decrement loop counter 
MPY -ML Ao,A2,A5 pe el * gi 
LDH .D1 *A4,A3 p** xitl 
SHR “82 B5,15,B2 7; yitl 
MPY -M1X B6,A3,A7 eR GD Fe oK a 1 
[Al] B Peovll LOOP 7* branch to loop 
STH .D2 B2, *B4++ ; store yitl 
MPY -M2X A8,B2,B3 ge OS.) “yal 
; Branch occurs here 
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6.7 If-Then-Else Statements in a Loop 


ee 


If-then-else statements in C cause certain instructions to execute when the if 
condition is true and other instructions to execute when it is false. One way to 
accomplish this in linear assembly code is with conditional instructions. be- 
cause all ’C6x instructions can be conditional on one of five general-purpose 
registers, conditional instructions can handle both the true and false cases of 
the if-then-else C statement. 


lf-Then-Else C Code 


Example 6—43 contains a loop with an if-then-else statement. You either add 
a[i] to sum or subtract a[i] from sum. 


Example 6-43. If-Then-Else C Code 


int if_then(short a[], int codeword, int mask, short theta) 


{ 


int i,sum, cond; 


sum = 0; 


for 


return (sum); 


} 


(i = 
cond 
pis a ( 


else 


mask 


} 


Os 2 < 32> i++) { 

= codeword & mask; 
theta == !(!(cond))) 
sum += a[il]; 


sum -= a[i]; 
= mask << 1; 


Branching is one way to execute the if-then-else statement: branch to the ADD 
when the if statement is true and branch to the SUB when the if statement is 
false. However, because each branch has five delay slots, this method 
requires additional cycles. Furthermore, branching within the loop makes soft- 
ware pipelining almost impossible. 


Using conditional instructions, on the other hand, eliminates the need to 
branch to the appropriate piece of code after checking whether the condition 
is true or false. Simply program both the ADD and SUB as usual, but make 
them conditional on the zero and nonzero values of a condition register. This 
method also allows you to software pipeline the loop and achieve much better 
performance than you would with branching. 
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6.7.2 Translating C Code to Linear Assembly 


Example 6—44 shows the linear assembly instructions needed to execute in- 
ner loop of the C code in Example 6-43. 


Example 6-44. Linear Assembly for If-Then-Else Inner Loop 


AND 
cond] MVK 
CMPEQ 
LDH 
if] ADD 
!it] SUB 
SHL 
cntr] ADD 
entr]B 


codeword, mask, cond cond = codeword & mask 


, 
1,cond ; !(! (cond) ) 
theta, cond,if ; (theta == !(! (cond))) 
xaptrt+t+,ai 7 ali] 
sum, ai,sum ; sum += a[il] 
sum, ai,sum 7; sum —-= afi] 


mask,1,mask mask = mask << 1; 


=L, centr, cntr ; decrement counter 
LOOP ; tor LOOP 
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CMPEQ is used to create IF. The ADD is conditional when IF is nonzero (corre- 
sponds to then); the SUB is conditional when IF is 0 (corresponds to else). 


A conditional MVK performs the |(!(cond)) C statement. If the result of the 
bitwise AND is nonzero, a1 is written into cond; if the result of the AND is 0, 
cond remains at 0. 


If-Then-Else Statements in a Loop 


6.7.3. Drawing a Dependency Graph 


Figure 6-17 shows the dependency graph for the if-then-else C code. This 
graph illustrates the following arrangement: 


Lj Two nodes on the graph contain sum: one for the ADD and one for the 
SUB. Because some iterations are performing an ADD and others are 
performing a SUB, each of these nodes is a possible input to the next itera- 
tion of either node. 


J The LDH ai instruction is a parent of both ADD sum and SUB sum, be- 
cause both instructions read ai. 


_) CMPEQ if is also a parent to ADD sum and SUB sum, because both read 
IF for the conditional execution. 


1 The result of SHL mask is read on the next iteration by the AND cond 
instruction. 


Figure 6—17. Dependency Graph of If-Then-Else Code 


A side 
SHL 


B side 
AND 
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6.7.4 Determining the Minimum Iteration Interval 


With nine instructions, the minimum iteration interval is at least 2, because a 
maximum of eight instructions can be in parallel. Based on the way the depen- 
dency graph in Figure 6—17 is split, five instructions are on the A side and four 
are on the B side. Because none of the instructions are MPYs, all instructions 
must go on the .S, .D, or .L units, which means you have a total of six 
resources. 


(j LDH must be on a.D unit. 

[1 SHL, B, and MVK must be on a.S unit. 

[1 The ADDs and SUB can be on the .S, .L, or .D units. 
[1 The AND can be ona.S or .L unit. 


From Table 6-18, you can see that no one resource is used more than two 
times, so the minimum iteration interval is still 2. 


Table 6-18. Resource Table for If-Then-Else Code 


(a) A side 


Unit(s) 
M1 
S1 
.D1 


.L1, .S1, or .D1 


Total non-.M units 
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(b) B side 

Instructions Total/Unit | Unit(s) Instructions Total/Unit 
0 | M2 0 
SHL & B 2 .S2 MVK 1 
LDH 1 -L2 CMPEQ 1 
ADD & SUB 2 -L2 or .S2 AND 1 
.L2,.S2, or .D2 ADD 1 
5 Total non-.M units 4 


The minimum iteration interval is also affected by the total number of instruc- 
tions. Because three units can perform nonmultiply operations on a given side, 
a total of five instructions can be performed with a minimum iteration interval 
of 2. Because only four instructions are on the B side, the minimum iteration 
interval is still 2. 
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6.7.5 Linear Assembly Resource Allocation 


Now that the graph is split and you know the minimum iteration interval, you 
can allocate functional units and registers to the instructions. You must ensure 
that no resource is used more than twice. 


Example 6—45 shows the linear assembly with the functional units and regis- 
ters that are used in the inner loop. 


Example 6—45. Linear Assembly for Full If-Then-Else Code 


-global _if_then 
_if_then: .cproc a, cword, mask, theta 
.reg cond, if, ai, sum, cntr 
MVK 32,cntr y; centr = 32 
ZERO sum 7; sum = 0 
LOOP: sp: 32 
AND «SAX cword,mask,cond; cond = codeword & mask 
[eond. MVK «52 1, cond ; !(! (cond) ) 
CMPEQ ~L2 theta,cond,if ; (theta == !(! (cond) )) 
LDH .D1 *att,al 7; alil 
[if ADD ~L1 sum, ai,sum ; sum += a[il] 
[tae SUB «DL sum,ai,sum ; sum -= af[il 
SHL «Sl mask,1,mask ; mask = mask << 1; 
fentr ADD + li2 =l,entr, cntr ; decrement counter 
[centr B 61 LOOP ; for LOOP 
-return sum 
-endproc 
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6.7.6 Final Assembly 


Example 6—46 shows the final assembly code after software pipelining. The 
performance of this loop is 70 cycles (2 x 32 +6). 


Example 6-46. Assembly Code for If-Then-Else 


MVK vod 32,B0 ; set up loop counter 
[BO] ADD ~L2 —1,B0,BO ; decrement counter 
{BO] ADD Pu -—1,B0,BO ; decrement counter 
[BO] B “Sl LOOP ; for LOOP 
LDH -D1 *A4++,A5 ; ali] 
SHL pec Ao,1,A6 ; mask = mask << 1; 
AND ~S2X B4,A6,B2 ; cond = codeword & mask 
{[B2] MVK ~S2 1,B2 * 1 (! {cond) ) 
[BO] ADD ~L2 —1,B0,BO ; decrement counter 
[BO] B -S1 LOOP ;* for LOOP 
LDH Peas *A4++,A5 7* ala] 
CMPEQ .L2 BG,B2,B1 ; (theta == !(! (cond) )) 
SHL cou Ao,1,A6 7;* mask = mask << 1; 
AND ~S2X B4,A6,B2 ;* cond = codeword & mask 
ZERO -L1 A7 ; zero out accumulator 
LOOP: 
{BO} ADD ole -1,B0,B0 ; decrement counter 
[B2] MVK 252 1,B2 ee TCT (cond) ) 
[BO] B -S1 LOOP 7** for LOOP 
LDH 5 DL, *A4++,A5 pee aa 
{B1] ADD -L1 A7,A5,A7 ; sum += a[il] 
['!B1]SUB 2D A7,A5,A7 ; sum -= a[il] 
CMPEQ .L2 BG, BZ, BL ;* (theta == !(! (cond))) 
SHL eo. Ao,1,A6 7** mask = mask << 1; 
AND ~S2X B4,A6,B2 7** cond = codeword & mask 
; Branch occurs here 
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6.7.7 Comparing Performance 


You can improve the performance of the code in Example 6-46 if you know 
that the loop count is at least 3. If the loop countis at least 3, remove the decre- 
ment counter instructions outside the loop and put the MVK (for setting up the 
loop counter) in parallel with the first branch. These two changes save two 
cycles at the beginning of the loop prolog. 


The first two branches are now unconditional, because the loop count is at 
least 3 and you know that the first two branches must execute. To account for 
the removal of the three decrement-loop-counter instructions, set the loop 
counter to 3 fewer than the actual number of times you want the loop to 
execute: in this case, 29 (32 — 3). 


Example 6-47. Assembly Code for If-Then-Else With Loop Count Greater Than 3 


, 


B .S1 
LDH .D1 
MVK 2582 
SHL 2S 
AND +S2X 
[B2] MVK -S2 
B nou 
LDH .D1 
CMPEQ .L2 
SHL Bod 
AND ~S2X 
ZERO ~L1 
LOOP: 
[BO] ADD “2 
[B2] MVK wo2 
[BO] B S21 
LDH sD. 
[Bl] ADD Shi 
[!B1] SUB .D1 
CMPEQ .L2 
SHL Sak 
AND sS2X 


Branch occurs here 


LOOP ; for LOOP 

*R44+4+,A5 > 2 | 

29,BO ; set up loop counter 
A6,1,A6 , mask = mask << 1; 
B4,A6,B2 ; cond = codeword & mask 
1, Be 3; § 4 (eond)) 

LOOP ;* for LOOP 

*R4++,A5 7* ali] 

Bo,B2,B1 ; (theta == !(! (cond))) 
A6,1,A6 ;* mask = mask << 1; 
B4,A6,B2 ;* cond = codeword & mask 
A7 ; zero out accumulator 
-1,B0,BO ; decrement counter 

1,B2 e*® 1 (1 (cond) } 

LOOP 7;** for LOOP 

*R44+4+,A5 ;** ali] 

A7,A5,A7 ; sum += a[il] 

A7,A5,A7 ; sum -= a[il] 

BG, B2,Bi1 7;* (theta == !(! (cond))) 
A6,1,A6 7;** mask = mask << 1; 
B4,A6,B2 ;** cond = codeword & mask 


Example 6—47 shows the improved loop with a cycle count of 68 (2 x 32+ 4). 
Table 6-19 compares the performance of Example 6-46 and Example 6-47. 
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Table 6-19. Comparison of If-Then-Else Code Examples 


Code Example Cycles 


Example 6-46 _If-then-else assembly code (2 x 32)+6 


Example 6-47  If-then-else assembly code with loop count greater than3 (2 x 32)+4 
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Cycle Count 
70 


68 
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6.8 Loop Unrolling 


Even though the performance of the previous example is good, it can be im- 
proved. When resources are not fully used, you can improve performance by 
unrolling the loop. In Example 6—48, only nine instructions execute every two 
cycles. If you unroll the loop and analyze the new minimum iteration interval, 
you have room to add instructions. A minimum iteration interval of 3 provides 
a 25% improvement in throughput: three cycles to do two iterations, rather 
than the four cycles required in Example 6—47. 


6.8.1 Unrolled If-Then-Else C Code 


Example 6-48 shows the unrolled version of the if-then-else C code in 
Example 6-43 on page 6-83. 


Example 6—48. If-Then-Else C Code (Unrolled) 


int unrolled_if_then(short a[], int codeword, int mask, short theta) 


{ 


int i,sum, cond; 


sum = 0; 
for (i = 0; i < 32; i+=2){ 


cond = codeword & mask; 

if (theta == !(!(cond))) 
sum += a[il]; 

else 
sum -= a[il]; 


mask = mask << 1; 


cond = codeword & mask; 

if (theta == !(!(cond))) 
sum += a[itl]; 

else 


sum -= a[itl]; 
mask = mask << 1; 
} 
return (sum); 


} 


Optimizing Assembly Code via Linear Assembly 6-91 


Part Ill 


Part Ill 


Loop Unrolling 


6.8.2 Translating C Code to Linear Assembly 


Example 6-49 shows the unrolled inner loop with 16 instructions and the 
possibility of achieving a loop with a minimum iteration interval of 3. 


Example 6-49. Linear Assembly for Unrolled If-Then-Else Inner Loop 


isa 
10 


codeword, maski, condi 
1,condi 

theta, condi,ifi 
xaptrt+t+,ali 
sumi,ai,sumi 
sumi,ai,sumi 
maski,1,maskitl 


codeword, maski+1l,condit+tl; 


1,conditl 

theta, condit+1l,ifi+l 
*aptr+t+,aitl 
sumitl,ai+1l,sumi+l 
sumit+l,aitl,sumi+l 
maski+l,1,maski 


=l,cntr; ontr 
LOOP 


condi = codeword & maski 
!(! (condi) ) 

(theta == !(! (condi) )) 
a[i] 

sum += a[il] 

sum -= al[il] 

maski+l = maski << 1; 
condit+tl = codeword & maski+1l 
!(! (condit1) ) 

(theta == !(!(conditl))) 
a[fit+! 

sum += a[itl] 

sum -= a[itl] 

maski = maskitl << 1; 


decrement counter 
for LOOP 
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6.8.3 Drawing a Dependency Graph 


Although there are numerous ways to split the dependency graph, the main 
goal is to achieve a minimum iteration interval of 3 and meet these conditions: 


J You cannot have more than nine non-.M instructions on either side. 
[] Only three non-.M instructions can execute per cycle. 


Figure 6—18 shows the dependency graph for the unrolled if-then-else code. 
Nine instructions are on the A side, and seven instructions are on the B side. 


Figure 6—18. Dependency Graph of If-Then-Else Code (Unrolled) 
A side B side 


AND SHL 
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6.8.4 Determining the Minimum Iteration Interval 


With 16 instructions, the minimum iteration interval is at least 3 because a 
maximum of six instructions can be in parallel with the following allocation 
possibilities: 


(_j LDH must be on a.D unit. 

[1 SHL, B, and MVK must be on a.S unit. 

[1 The ADDs and SUB can be ona..§, .L, or .D unit. 
[1 The AND can be ona.§S or .L unit. 


From Table 6—20, you can see that no one resource is used more than three 
times so that the minimum iteration interval is still 3. 


Checking the total number of non-.M instructions on each side shows that a 
total of nine instructions can be performed with the minimum iteration interval 
of 3. because only seven non-.M instructions are on the B side, the minimum 
iteration interval is still 3. 


Table 6-20. Resource Table for Unrolled If-Then-Else Code 


(a) A side 
Unit(s) 

.M1 

S1 

.D1 

.L1 

.L1 or .S1 

.L1, .S1, or .D1 


Total non-.M units 


(b) B side 

Instructions Total/Unit | Unit(s) Instructions Total/Unit 

0 | M2 0 
MVK and 2 SHLs 3 .S2 MVK and B 2 
2 LDHs 2 -L2 CMPEQ 1 
CMPEQ 1 .L2 pr.S2 AND 1 
AND 1 .L2 ,.S2, or.D2 SUB and 2ADDs 3 
ADD and SUB 2 

9 Total non-.M units 7 


6.8.5 Linear Assembly Resource Allocation 
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Now that the graph is split and you know the minimum iteration interval, you 
can allocate functional units and registers to the instructions. You must ensure 
no resource is used more than three times. 


Example 6-50 shows the linear assembly code with the functional units and 
registers. 


Loop Unrolling 


Example 6-50. Linear Assembly for Full Unrolled If-Then-Else Code 


_unrolled_if_then: 


LOOP: 


[cdi] 


[ifi] 
[!ifi] 


[cdil 


[ifil 
[!ifil 


[centr 
[centr 


-global _unrolled_if_then 
~eproc a, cword, mask, theta 
.reg cword, mask, theta, ifi, ifil, a, ai, ail, cntr 


.reg cdi, cdil, sumi, 

MV A4,a 

MV B4, cword 

MV A6,mask 

MV Bo,theta 

MVK 16, centr 

ZERO sumi 

ZERO sumil 

-trip 32 

AND .-L1X cword,mask,cdi 
MVK ~Si 1, edi 

CMPEQ .L1IXtheta,cdi,ifi 
LDH -D1 *att+,ai 

ADD -L1 sumi,ai,sumi 
SUB -D1 sumi,ai,sumi 
SHL -S1l mask,1,mask 
AND .L2X cword,mask,cdil 
MVK S2 1,cdil 

CMPEQ 2 theta,cdil,ifil 
LDH D1 *at++,ail 

ADD -L2 sumil,ail, sumil 
SUB -D2 sumil,ail,sumil 
SHL -S1l mask,1,mask 

ADD 22. =1,pentr;, Cntr 

B -S2 LOOP 

ADD sumi, sumil,sum 


.return sum 


.endproc 


sumil, sum 


C callable register for lst operand 
7 C callable register for 2nd operand 

C callable register for 3rd operand 

C callable register for 4th operand 
; centr = 32/2 


; sumi = 0 

j sumit+l = 0 

; cdi = codeword & maski 
7 }(!(cdi)) 

; (theta == !(! (cdi))) 

; ali) 

j; sum += a[il] 

; sum -= afi] 

; maski+l = maski << 1; 
; cditl = codeword & maskitl 
e ACY pedicel) 

; (theta == !(! (cditl))) 
; a[litl] 

j sum += a[itl] 

; sum —-= a[itl] 


; maski = maskitl << 1; 


; decrement counter 
; for LOOP 


; Add sumi and sumit+l for ret value 
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6.8.6 Final Assembly 


Example 6-51 shows the final assembly code after software pipelining. The 
cycle count of this loop is now 53: (3 x 16) + 5. 


Example 6-51. Assembly Code for Unrolled If-Then-Else 


MVK ao2 16,B0 ; set up loop counter 
LDH -D1 *R4++,A5 ; ali] 
[BO] ADD »D2 —1,B0,B0 ; decrement counter 
LDH -D1 *R4++,B5 ; afitl] 
[BO] B ~S2 LOOP ; for LOOP 
BO] ADD «DZ -1,B0,BO ; decrement counter 
SHL soak Ao,1,A6 ; maskitl = maski << 1; 
AND .L1X B4,A6,A2 ; condi = codeword & maski 
[A2] MVK «Sl 1,A2 ; !'!(! (condi) ) 
AND ~L2X B4,A6,B2 ; condit+tl = codeword & maskitl 
ZERO -L1 A7 ; zero accumulator 
{[B2] MVK -S2 1,B2 ; !(! (condit+l1) ) 
| CMPEQ .L1X B6,A2,Al1 j (theta == !(! (condi))) 
| SHL sok A6,1,A6 ; maski = maskitl << 1; 
| LDH D1 *A4++,A5 ;* ali] 
| ZERO -L2 B7 ; zero accumulator 
LOOP 
CMPEQ .L2 B6,B2,Bl1 ; (theta == !(! (condi+1))) 
[BO] ADD «D2 =1,B0, B0 ; decrement counter 
LDH .D1 *DR4++,B5 ;* a[litl] 
[BO] B #S2Z LOOP ;* for LOOP 
SHL “oi. A6,1,A6 ;* maskit+tl = maski << 1; 
AND .L1X B4,A6,A2 ;* condi = codeword & maski 
Al] ADD ale A7,A5,A7 ; sum += a[il] 
'A1]SUB -D1 A7,A5,A7 7; sum -= afi] 
A2] MVK oui 1,A2 7* !(! (condi) ) 
AND ~L2X B4,A6,B2 ;* condit+tl = codeword & maskitl 
B1] ADD ~L2 B7,B5,B7 ; sum += a[it+l] 
!'B1] SUB 2DA B7,B5,B7 ; sum —-= a[itl] 
B2] MVK so2 1,B2 7* 1(! (condit+l1) ) 
CMPEQ .L1X B6,A2,Al1 7;* (theta == !(! (condi))) 
SHL -S1 A6,1,A6 ;* maski = maskitl << 1; 
LDH .D1 *A4++,A5 7** ali] 
; Branch occurs here 
ADD ~L1X A7,B7,A4 ; move to return register 
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6.8.7 Comparing Performance 


Table 6-21 compares the performance of all versions of the if-then-else code 
examples. 


Table 6-21. Comparison of If-Then-Else Code Examples 


Code Example Cycles Cycle Count 
Example 6—46 __If-then-else assembly code (2 x 32)+6 70 
Example 6—47 _ If-then-else assembly code with loop count greater than3 (2 x 32)+4 68 
Example 6-51 ~Unrolled if-then-else assembly code (3 x 16)+5 53 
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6.9 Live-Too-Long Issues 


When the result of a parent instruction is live longer than the minimum iteration 
interval of a loop, you have a live-too-long problem. Because each instruction 
executes every iteration interval cycle, the next iteration of that parent over- 
writes the register with a new value before the child can read it. Section 6.5.6.1, 
Resource Conflicts, on page 6-61 showed how to solve this problem simply 
by moving the parent to a later cycle. This is not always a valid solution. 


6.9.1 C Code With Live-Too-Long Problem 


Example 6-52 shows C code with a live-too-long problem that cannot be 
solved by rescheduling the parent instruction. Although it is not obvious from 
the C code, the dependency graph in Figure 6-19 on page 6-100 shows a split- 
join path that causes this live-too-long problem. 


Example 6-52. Live-Too-Long C Code 


{ 


} 


int live_long(short a[],short b[],short c, short d, short e) 


int i,sum0,suml1,sum,a0,a2,a3,b0,b2,b3; 
short al,bl; 


sum0 = 0; 
suml = 0; 
for (i=0; i<100; i++) { 


aOQ = ali] * c; 
al = a0 >> 15; 
a2 =al* qd; 
a3 = a2 a0; 
sum0 += a3; 
bO = b[i] * c; 
bl = bO >> 15; 
b2 = bl * e; 
b3 = b2 + b0; 
suml += b3; 
} 

sum = sum0 + suml; 


return (sum) ; 
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6.9.2 Translating C Code to Linear Assembly 


Example 6-53 shows the assembly instructions that execute the inner loop in 
Example 6-52. 


Example 6—53. Linear Assembly for Live-Too-Long Inner Loop 


LDH *aptrt+t+,ai ; load ai from memory 
LDH *bptr+t+,bi ; load bi from memory 
PY ai,c,a0 ; aO=ai*e 
SHR a0,15,al ; al = a0 >> 15 
PY al,d,a2 ;j a2=alr*tad 
ADD a2,a0,a3 7; a3 = a2 + al 
ADD sum0,a3,sum0 ; sum0 += a3 
PY bi,c,b0 ; bO = bi * c 
SHR b0,15,b1 ; bl = bO >> 15 
PY b1,e,b2 , b2 =bl *e 
ADD b2,b0,b3 ; b3 = b2 + bO 
ADD suml,b3,suml ; suml += b3 
{cntr]SUB entr, 1, centr ; decrement loop counter 
[cntr]B LOOP ; branch to loop 


6.9.3 Drawing a Dependency Graph 


Figure 6-19 shows the dependency graph for the live-too-long code. This 
algorithm includes three separate and independent graphs. Two of the inde- 
pendent graphs have split-join paths: from a0 to a3 and from b0 to b3. 
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Figure 6—19. Dependency Graph of Live-Too-Long Code 
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6.9.4 Determining the Minimum Iteration Interval 


Table 6—22 shows the functional unit resources for the loop. Based on the re- 
source usage, the minimum iteration interval is 2 for the following reasons: 


Lj) No specific resource is used more than twice, implying a minimum itera- 
tion interval of 2. 


1 A total of five non-.M units on each side also implies a minimum iteration 
interval of 2, because three non-.M units can be used on aside during each 
cycle. 


Table 6-22. Resource Table for Live-Too-Long Code 


(a) A side 
Unit(s) 

.M1 

S1 

.D1 

.L1, .S1, or .D1 


Total non-.M units 


(b) B side 
Instructions Total/Unit | Unit(s) Instructions Total/Unit 
MPY 1 | M2 MPY 1 
B and SHR 2 .S2 SHR 1 
LDH 1 .D2 LDH 1 
2 ADDs 2 .L2,.S2,or.D2 2ADDs and SUB 3 
5 Total non-.M units 5 


However, the minimum iteration interval is determined by both resources and 
data dependency. A loop carry path determined the minimum iteration interval 
of the IIR filter in section 6.6, Loop Carry Paths, on page 6-74. In this example, 
a live-too-long problem determines the minimum iteration interval. 


6.9.4.1 Split-Join-Path Problems 


In Figure 6-19, the two split-join paths from a0 to a3 and from b0 to b3 create 
the live-too-long problem. Because the ADD a3 instruction cannot be sched- 
uled until the SHR a1 and MPY a2 instructions finish, a0 must be live for at least 
four cycles. For example: 


1 IfMPY a0 is scheduled on cycle 5, then the earliest SHR a1 can be sched- 
uled is cycle 7. 


1 The earliest MPY a2 can be scheduled is cycle 8. 


_j The earliest ADD a3 can be scheduled is cycle 10. 
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Because a0 is written at the end of cycle 6, it must be live from cycle 7 to 
cycle 10, or four cycles. No value can be live longer than the minimum iteration 
interval, because the next iteration of the loop will overwrite that value before 
the current iteration can read the value. Therefore, if the value has to be live 
for four cycles, the minimum iteration interval must be at least 4. A minimum 
iteration interval of 4 means that the loop executes at half the performance that 
it could based on available resources. 


6.9.4.2 Unrolling the Loop 


One way to solve this problem is to unroll the loop, so that you are doing twice 
as much work in each iteration. After unrolling, the minimum iteration interval 
is 4, based on both the resources and the data dependencies of the split-join 
path. Although unrolling the loop allows you to achieve the highest possible 
loop throughput, unrolling the loop does increase the code size. 


6.9.4.3. Inserting Moves 


Another solution to the live-too-long problem is to break up the lifetime of a0 
and b0 by inserting move (MV) instructions. The MV instruction breaks up the 
left path of the split-join path into two smaller pieces. 


6.9.4.4 Drawing a New Dependency Graph 
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Figure 6—20 shows the new dependency graph with the MV instructions. Now 
the left paths of the split-join paths are broken into two pieces. Each value, a0 
and a0’, can be live for minimum iteration interval cycles. If MPY a0 is sched- 
uled on cycle 5 and ADD a3 is scheduled on cycle 10, you can achieve a mini- 
mum iteration interval of 2 by scheduling MV a0’ on cycle 8. Then a0 is live on 
cycles 7 and 8, and a0’ is live on cycles 9 and 10. Because no values are live 
more than two cycles, the minimum iteration interval for this graph is 2. 
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Figure 6—20. Dependency Graph of Live-Too-Long Code (Split-Join Path Resolved) 


A side B side 
LDH LDH 
5 5 
MPY MPY 
2 2 
9 SHR. 2 SHR 
MV ' MV 
1 1 
MPY MPY 
1 1 
2 2 
ADD! ADD 
ADD] , ADD] , = 
} 5 
oO 


6.9.5 Linear Assembly Resource Allocation 


Example 6—54 shows the linear assembly code with the functional units as- 
signed. The choice of units for the ADDs and SUB is flexible and represents 
one of a number of possibilities. One goal is to ensure that no functional unit 
is used more than the minimum iteration interval, or two times. 


The two 2X paths and one 1X path are required because the values c, d, and 
e reside on the side opposite from the instruction that is reading them. If these 
values had created a bottleneck of resources and caused the minimum itera- 
tion interval to increase, c, d, and e could have been loaded into the opposite 
register file outside the loop to eliminate the cross path. 
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Example 6—54. Linear Assembly for Full Live-Too-Long Code 


-global _live_long 
_live_long: .cproc a, b, c, da, e 
reg ai, bi, sum0, suml, sum 
reg aOp, a_0O, a_l, a_2, 
MVK 100,cntr 
ZERO sum0 
ZERO suml 
LOOP: .trip 100 
LDH D1 *katt,ai 
LDH D2 *b++,bi 
PY -M1 ai,c,a_0 
SHR eo. a0; 157.21 
PY 1X a_tya;a_2 
iV D1 a_0,a0p 
ADD Ll a_2,a0p,a_3 
ADD Ll sum0,a_3,sum0 
PY -M2X bi,c,b_0 
SHR coe 1p: 0:4 55a. 
PY 2X b_1,e,b_2 
iV .D2 b_0,b0p 7 
ADD ~L2 b_2,b0p,b_3 7 
ADD L2 suml1,b_3, suml : 
[centr] SUB S82 entry, L,ente : 
[centr] B ook LOOP : 
ADD sum0, suml, sum : 
.-return sum 
-endproc 


bOp, b_1, b_2, b_3, cntr 
centr = 100 
sum0 = 0 


suml = 0 


ai 
bi 
ai 
= a0 
=al 
save a0 
a3 = a2 
sum0 += 
bO = bi 
bl = bo 
b2 = bl 
save b0O 
b3 = b2 
suml += 


from memory 

from memory 

M1 

>> 15 

ed 

across iterations 
+ a0 


across iterations 
+ bO 
b3 


decrement loop counter 
branch to loop 


Add sumi and sumitl for ret value 
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6.9.6 Final Assembly With Move Instructions 


Example 6-55 shows the final assembly code after software pipelining. The 
performance of this loop is 212 cycles (2 «100 + 11 + 1). 


Example 6-55. Assembly Code for Live-Too-Long With Move Instructions 


LDH «DL *A4++,A0 ; load ai from memory 
|| LDH .D2 *B4++,BO ; load bi from memory 
MVK 22 100,B2 ; set up loop counter 
LDH -D1 *A4++,A0 ;* load ai from memory 
1 | LDH +D2 *B4++,BO0 7* load bi from memory 
ZERO ol Al ; zero out accumulator 
1 | ZERO «S82 Bl ; zero out accumulator 
LDH Dil *A4++,A0 7** load ai from memory 
\ | LDH .D2 *B4++,BO 7** load bi from memory 
[B2] SUB o82 B2,1,B2 ; decrement loop counter 
MPY .M1 AO,A6,A3 ; aO =ai*ec 
MPY ~M2X BO,A6,B10 ; bO = bi * ¢ 
LDH DL *A4++,A0 7*** load ai from memory 
LDH .D2 *B4++,BO0 7*** load bi from memory 
[B2] SUB s62 B2,1,B2 ; decrement loop counter 
{B2] B soul LOOP ; branch to loop 
= 
= 
SHR .Sl A3,15,A5 ; al = a0 >> 15 5 
SHR .S2 B10,15,B5 ; bl = bO >> 15 oO 
MPY .M1 A0O,A6,A3 7;* aO = ai*e 
MPY ~M2X BO,A6,B10 ;* bO = bi*c 
LDH .D1 *A4++,A0 ;**** load ai from memory 
LDH .D2 *B4++,BO0 7;**** load bi from memory 
MPY .M1X A5,B6,A7 ; a2=al*a 
MV <DI A3,A2 ; save a0 across iterations 
MPY ~M2X B5,A8,B7 ; b2 = bl * e 
MV DZ B10,B8 ; save bO across iterations 
{B2] SUB woe B2,1,B2 ;* decrement loop counter 
{[B2] B .S1 LOOP 7* branch to loop 
SHR yok A3,15,A5 7;* al = a0 >> 15 
SHR OZ B10,15,B5 7* bl = bO >> 15 
MPY .M1 AO,A6,A3 7** a0 =ai*ec 
MPY .M2X  BO,A6,B10 ;** bO = bi * c 
LDH .D1 *A4++,A0 7***** load ai from memory 
LDH D2 *B4++,BO 7***** load bi from memory 
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Example 6—55. Assembly Code for Live-Too-Long With Move Instructions (Continued) 


LOOP: 

ADD Aa A7,A2,A9 7* a3 = a2 + al 

ADD a2 B7,B8,B9 7* b3 = b2 + bO 

MPY .M1X  A5,B6,A7 j* aQ=al*ad 

MV -D1 A3,A2 ;* save a0 across iterations 

MPY .M2X  B5,A8,B7 j* b2 =bl *e 

MV ~D2 B10,B8 ;* save bO across iterations 
[B2] SUB eoZ B2,1,B2 7** decrement loop counter 
[B2] B wo LOOP 7** branch to loop 

ADD Pal ipl A1,A9,Al1 ; sum0 += a3 

ADD ~L2 B1,B9,Bl ; suml += b3 

SHR -S1 A3,15,A5 7** al = a0 >> 15 

SHR «O24 B10,15,B5 7** bl = bO >> 15 

MPY -M1 AO,A6,A3 7*** aQ = ai *ec 

MPY .M2X  BO,A6,B10 ;*** bO = bi *c 

LDH -D1 *A4++,A0 7**x**** Load ai from memory 

LDH .D2 *B4++,BO 7****** Load bi from memory 

; Branch occurs here 

ADD .L1X Al1,B1,A4 ; sum = sum0 + suml 
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6.10 Redundant Load Elimination 


Filter algorithms typically read the same value from memory multiple times and 
are, therefore, prime candidates for optimization by eliminating redundant load 
instructions. Rather than perform a load operation each time a particular value 
is read, you can keep the value in a register and read the register multiple 
times. 


6.10.1 FIR Filter C Code 


Example 6-56 shows C code for a simple FIR filter. There are two memory 
reads (x[i+j] and h[i]) for each multiply. Because the ’C6x can perform only two 
LDHs per cycle, it seems, at first glance, that only one multiply-accumulate per 
cycle is possible. 


Example 6-56. FIR Filter C Code 


void fir(short x[], short h[], short y[]) 
{ 

int i, j, sum; 

for (j = 0; j < 100; jtt+) { 


sum = 0; 
for (i = 0; i < 32; i++) 

sum += x[itj] * h[i]; 
y{j] = sum >> 15; 


One way to optimize this situation is to perform LDWs instead of LDHs to read 
two data values at atime. Although using LDW works for the h array, the x array 
presents a different problem because the ’C6x does not allow you to load 
values across a word boundary. 


For example, on the first outer loop (j = 0), you can read the x-array elements 
(0 and 1, 2 and 3, etc.) as long as elements 0 and 1 are aligned on a 4-byte 
word boundary. However, the second outer loop (j = 1) requires reading x-array 
elements 1 through 32. The LDW operation must load elements that are not 
word-aligned (1 and 2, 3 and 4, etc.). 


6.10.1.1 Redundant Loads 


In order to achieve two multiply-accumulates per cycle, you must reduce the 
number of LDHs. Because successive outer loops read all the same h-array 
values and almost all of the same x-array values, you can eliminate the redun- 
dant loads by unrolling the inner and outer loops. 


For example, x[1] is needed for the first outer loop (x[j+1] with j = 0) and for the 
second outer loop (x{j] with j = 1). You can use a single LDH instruction to load 
this value. 
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6.10.1.2 New FIR Filter C Code 


Example 6-57 shows that after eliminating redundant loads, there are four 
memory-read operations for every four multiply-accumulate operations. Now 
the memory accesses no longer limit the performance. 


Example 6-57. FIR Filter C Code With Redundant Load Elimination 


void fir(short x[], short h[], short y[]) 
{ 


int i, j, sum0, suml; 
short x0,x1,h0,hl1; 


for (j = 0; 3 < 100; jt=2) { 


sum0 = 0; 

suml = 0; 

x0 = x[J]; 

for (i = 0; i < 32; it=2){ 
xl = x[jt+it+l]; 
ho = h[il]; 
sum0O += x0 * hO; 
suml += x1 * hO; 
xO = x[Jt+it+2]; 
hl = h[itl]; 
sum0 += x1 * hil; 
suml += x0 * hil; 
} 

y[j]l = sum0 >> 15; 

y[j+1] = suml >> 15; 
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6.10.2 Translating C Code to Linear Assembly 


Example 6—58 shows the linear assembly that perform the inner loop of the 
FIR filter C code. 


Element x0 is read by the MPY p00 before it is loaded by the LDH x0 instruc- 
tion; x{[j] (the first xO) is loaded outside the loop, but successive even elements 
are loaded inside the loop. 


Example 6—58. Linear Assembly for FIR Inner Loop 


LDH 
LDH 


sD2 *x 14+[2],x1 ; Xl = x[jtit+l] 

Pa *h++[2],h0 ; ho = h[i] 

.M1 x0,h0,p00 ; x0 * ho 

.M1X x1,h0,p10 ax D * cho 

-L1 p00, sum0, sum0 ; sumO += xO * hO 
.L2X pl0,suml1, suml1 ; suml += xl * hO 
-D1 *x++[2],x0 7; xO = x[J+it+2] 
«D2 *h_14++[2],h1 ; hl = h[itl] 

.M2 x1,h1,p01 esi) bT 

.M2X x0,h1,pll e360, % chia 

.L1X p01,sum0, sum0 ; sum0O += xl * Al 
.L2 pll,suml1, suml 7; suml += x0 * hil 
Pas iA Gtr, 1; ctr ; decrement loop counter 
oe LOOP ; branch to loop 
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6.10.3 Drawing a Dependency Graph 


Figure 6—21 shows the dependency graph of the FIR filter with redundant load 
elimination. 


Figure 6-21. Dependency Graph of FIR Filter (With Redundant Load Elimination) 
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6.10.4 Determining the Minimum Iteration Interval 


Table 6-23 shows that the minimum iteration interval is 2. An iteration interval 
of 2 means that two multiply-accumulates are executing per cycle. 


Table 6-23. Resource Table for FIR Filter Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit | Unit(s) Instructions Total/Unit 
.M1 2 MPYs 2 | .M2 2 MPYs 2 

S1 0 .S2 B 1 

.D1 2 LDHs 2 .D2 2 LDHs 2 
.L1,.S1,or.D1  2ADDs 2 .L2, .S2, .D2 2 ADDs and SUB 3 
Total non-.M units 4 | Total non-.M units 6 

1X paths 2 | 2X paths 2 


6.10.5 Linear Assembly Resource Allocation 


Example 6—59 shows the linear assembly with functional units and registers 
assigned. 


Example 6-59. Linear Assembly for Full FIR Code 


-global _fir 
Pars <Cproc se. hy. “V¥ 
.reg xl, bho, sum0, suml, cir, octr 
.reg p00, pOl, pl0, pill, x0, x1, hO, hl, rstx, rsth 
ADD Hye, hd ; set up pointer to h[1] 
MVK 50,octr ; outer loop ctr = 100/2 
MVK 64,rstx ; used to rst x pointer each outer loop 
MVK 64,rsth ; used to rst h pointer each outer loop 
OUTLOOP: 
ADD %;p 2p 81 ; set up pointer to x[j+1] 
SUB h_1,2,h ; set up pointer to h[0] 
MVK 16,ctr ; inner loop ctr = 32/2 
ZERO sum0 ; sum0 = 0 
ZERO suml ; suml = 0 
[oete] SUB octy, 1,OCctr ; decrement outer loop counter 
LDH -D1 *¥e++ (2), "0 7; xO = x[j] 
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Example 6—59. Linear Assembly for Full FIR Code (Continued) 


LOOP: -trip 16 
LDH oP, 
LDH .D1 
MPY sid. 
MPY .M1X 
ADD ~L1 
ADD .L2X 
LDH .D1 
LDH .D2 
MPY .M2 
MPY -M2X 
ADD .L1xX 
ADD .L2 
fete] SUB -52 
[etx] B +52 
SHR 
SHR 
STH 
STH 
SUB 
SUB 
[octr] B 
.endproc 


ex 1++[2],x1 
*h++[2],h0 
x0,h0,p00 
x1,h0,p10 
p00, sum0, sum0 
pl0,suml1, suml 


*x++[2],x0 
*h_1++[2],h1 
x1l,h1,p01 
x0,h1,pll1 
p01,sum0, sum0 
pll,suml1, suml 


ctr, 1, ctr 
LOOP 


sum0,15, sum0 
suml,15,suml 
sum0, *y+ 
suml, *y+ 
x;UStx,= 
hod, 2£sth, hel 
OUTLOOP 


sum0 += xl * hl 
suml += x0 * hl 


decrement loop counter 
branch to loop 

sum0 >> 15 

suml >> 15 

y[j3] = sum0 >> 15 
y[j+1] = suml >> 15 
reset x pointer to x[j] 
reset h pointer to h[0] 
branch to outer loop 


6.10.6 Final Assembly 


Example 6—60 shows the final assembly for the FIR filter without redundant 
load instructions. At the end of the inner loop is a branch to OUTLOOP that 
executes the next outer loop. The outer loop counter is 50 because iterations 
j and j + 1 execute each time the inner loop is run. The inner loop counter is 
16 because iterations i andi + 1 execute each inner loop iteration. 


The cycle count for this nested loop is 2352 cycles: 50 (16 x 2+9+6)+2. 


Fifteen cycles are overhead for each outer loop: 


See section 6.12, Software Pipelining the Outer Loop, on page 6-128 for in- 


J Nine cycles execute the inner loop prolog. 
J Six cycles execute the branch to the outer loop. 


formation on how to reduce this overhead. 
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Example 6-60. Final Assembly Code for FIR Filter With Redundant Load Elimination 


MVK “SL 50,A2 ; set up outer loop counter 
MVK -S1 80,A3 ; used to rst x ptr outer loop 
1 | MVK noe 82,B6 ; used to rst h ptr outer loop 
OUTLOOP: 

LDH .D1 *A4++[2],A0 ; x0 = x[j] @) 
ADD .L2X A4,2,B5 ; set up pointer to x[jt1] 
ADD .D2 B4,2,B4 ; set up pointer to h[1] 
ADD .L1X B4,0,A5 ; set up pointer to h[0] 
MVK «92 16,B2 ; set up inner loop counter 

{A2] SUB Oaks A2,1,A2 ; decrement outer loop counter 
LDH .D1 *A5++(2],Al ; ho = hf[il @) 
LDH AD? *B5++[2],Bl1 ; xl = x[jtitl 
ZERO -L1 AY ; zero out sumd 
ZERO eLi2 B9 ; zero out suml 
LDH .D2 *B4++[2],B0 ; hl = h[i+1] @) 
LDH {Di *A4++[2],A0 ; xO = x[Jjt+it2 
LDH D1 *A5++[2],Al ;* hO = hl @) 
LDH .D2 *B5++[2],Bl1 7* xl = x[jtitl] 

[B2] SUB Sz B2,1,B2 ; Gecrement inner loop counter ©) 
LDH {D2 *B4++[2],B0 7* hl = h[itl 
LDH 2Da. *A4++[2],A0 7* x0 = x[Jjt+it2] 

[B2] B 282 LOOP ; branch to inner loop ©) 
LDH Di. *A5++(2],Al1 7** hO = h[il] 
LDH sD? *B5+4+[2],Bl 7** xl = x[jtitl] 
MPY .M1 AO,A1,A7 ; xO * ho @ 

{B2] SUB .S2 B2,1,B2 ;* decrement inner loop counter 
LDH {D2 *B4++[2],B0 7** hl = h[itl] 
LDH DL. *A4++[2],A0 7;** xO = x[Jt+it2] 
MPY M2 B1,B0,B7 ; xl * hl 
MPY .-M1X B1,A1,A8 } xi * AO 

{[B2] B =$2 LOOP ;* branch to inner loop 
LDH .D1 *A5++[2],Al1 7*** ho = h[il 
LDH .D2 *B5++[2],Bl1 7*** x1 = x[Jtitl] 
MPY .M2X A0O,BO,B8 & 0* hdl @) 
MPY .M1 AO,A1,A7 7* x0 * ho 

{B2] SUB woe B2,1,B2 ;** decrement inner loop counter 
LDH .D2 *B4++[2],B0 7*** Hl = h[itl] 
LDH .D1 *A4++[2],A0 7*** xO = x[Jtit2] 
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Example 6-60 Final Assembly Code for FIR Filter With Redundant Load Elimination 


, 


outer loop branch occurs 


here 


(Continued) 
LOOP: 

ADD .L2X A8,B9,B9 ; suml += xl * hO 
ADD Pan A7,A9,A9 ; sumO += x0 * hO 
MPY 2 B1,B0,B7 ;* xl * hil 
MPY 1X B1,A1,A8 a 31 * WG 

[B2] B ~S2 LOOP ;** branch to inner loop 
LDH -D1 *A5++[(2],Al 7**** hO = h[i] 
LDH .D2 *B5++[2],Bl p***ee x1 = x[Jt+itl] 
ADD -L1X B7,A9,A9 ; sumO += xl * hl 
ADD ~L2 B8,B9,B9 ; suml += x0 * hl 
MPY .M2X  AO,BO,B8 ;* x0 * Al 
MPY -M1 AO,A1,A7 pee xO * HO 

[B2] SUB “oO B2,1,B2 7*** decrement inner loop cntr 
LDH .D2 *B4++[2],B0O pe*** Hl = h[itl] 
LDH -D1 *R4++[2],A0 7**KK xO = x[Jt+it2] 
; inner loop branch occurs here 

{A2] B Feopil OUTLOOP ; branch to outer loop @) 
SUB Pp ea A4,A3,A4 ; reset x pointer to x[j] 
SUB -L2 B4,B6,B4 ; reset h pointer to h[0] 
SHR .S1 — A9,15,A9 ; sum0 >> 15 @) 
SHR Peoys B9,15,B9 ; suml >> 15 
STH pel, AQ, *A6++ > y[j] = sum0 >> 15 @) 
STH -D1 BO, *A6++ ; yljtl] = suml >> 15 @) 
NOP 2 ; branch delay slots ©) 
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6.11 Memory Banks 


The internal memory of the ’C6x family varies from device to device. See the 
TMS320C62x/C67x Peripherals Reference Guide to determine the memory 
blocks in your particular device. This section discusses how to write code to 
avoid memory bank conflicts. 


Most ’C6x devices use an interleaved memory bank scheme, as shown in 
Figure 6-22. Each number in the boxes represents a byte address. A load byte 
(LDB) instruction from address 0 loads byte 0 in bank 0. A load halfword (LDH) 
from address 0 loads the halfword value in bytes 0 and 1, which are also in 
bank 0. An LDW from address 0 loads bytes 0 through 3 in banks 0 and 1. 


Because each bank is single-ported memory, only one access to each bank 
is allowed per cycle. Two accesses to a single bank in a given cycle result in 
a memory stall that halts all pipeline operation for one cycle, while the second 
value is read from memory. Two memory operations per cycle are allowed 
without any stall, as long as they do not access the same bank. 


Figure 6-22. 4-Bank Interleaved Memory 


8 9 10 11 12 13 14 15 


8N | 8N+1] /8N+2/8N+3 8N+4/8N+5] |8N+6/8N+7 


Bank 0 Bank 1 Bank 2 Bank 3 
For devices that have more than one memory block (see Figure 6-23), an 


access to bank 0 in one block does not interfere with an access to bank 0 in 
another memory block, and no pipeline stall occurs. 
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Figure 6-23. 4-Bank Interleaved Memory With Two Memory Blocks 


Memory 
block 0 [ L e 3 


8 9 10 11 12 13 14 15 


Memory 8M |8M+1 8M + 2/8M + 3 8M + 4/8M+5 8M + 6/8M + 7 
block 1 


Bank 0 Bank 1 Bank 2 Bank 3 


If each array in a loop resides in a separate memory block, the 2-cycle loop 
in Example 6-57 on page 6-108 is sufficient. This section describes a solution 
when two arrays must reside in the same memory block. 
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Memory Banks 


Example 6—61 shows the inner loop from the final assembly in Example 6-60. 
The LDHs from the h array are in parallel with LDHs from the x array. If x[1] is 
on an even halfword (bank 0) and h[0] is on an odd halfword (bank 1), 
Example 6-61 has no memory conflicts. However, if both x[1] and h[0] are on 
an even halfword in memory (bank 0) and they are in the same memory block, 
every cycle incurs a memory pipeline stall and the loop runs at half the speed. 


Example 6-61. Final Assembly Code for Inner Loop of FIR Filter 


LOOP: 


[B2] 


.L2X A8,B9,B9 
-L1 A7,A9,A9 
M2 B1,BO,3B7 
-M1X B1,A1,A8 
.S2 LOOP 

D1 *A5++[2],Al1 
D2 ABSAT LZ: 7 BL 
-L1X B7,A9,A9 
~L2 B8,B9,B9 
.M2X AO,BO,B8 
-M1 AO,A1,A7 
82 B2,1,.B2 

.D2 *B4++[2],B0 
D1 *A4++[2],A0 


xxx decrement inner loop cntr 
h[it 
+i+2] 


KKK AL 


hl 
hod 


x1 -* 
sO) 


to inner loop 


h[i 


hod 
hod 


] 


x[j+itl] 


xl * 
x0 * 


x[j 


hl 
hl 


1] 


Itis not always possible to fully control how arrays are aligned, especially if one 
of the arrays is passed into a function as a pointer and that pointer has different 
alignments each time the function is called. One solution to this problem is to 


Part Ill 


write an FIR filter that avoids memory hits, regardless of the x and h array align- 


ments. 


If accesses to the even and odd elements of an array (h or x) are scheduled 
onthe same cycle, the accesses are always on adjacent memory banks. Thus, 
to write an FIR filter that never has memory hits, even and odd elements of the 


same array must be scheduled on the same loop cycle. 
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Figure 6-24. 
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In the case of the FIR filter, scheduling the even and odd elements of the same 
array on the same loop cycle cannot be done in a 2-cycle loop, as shown in 
Figure 6—24. In this example, a valid 2-cycle software-pipelined loop without 
memory constraints is ruled by the following constraints: 


[j} LDH hO and LDH hi are on the same loop cycle. 
[j LDH x0 and LDH x1 are on the same loop cycle. 


1 MPY poo must be scheduled three or four cycles after LDH x0, because 
it must read xO from the previous iteration of LDH x0. 


(j All MPYs must be five or six cycles after their LDH parents. 


_j No MPYs on the same side (A or B) can be on the same loop cycle. 


Dependency Graph of FIR Filter (With Even and Odd Elements of 
Each Array on Same Loop Cycle) 


A side B side 


Note: Numbers in bold represent the cycle the instruction is scheduled on. 


The scenario in Figure 6-24 almost works. All nodes satisfy the above 
constraints except MPY p10. Because one parent is on cycle 1 (LDH hO) and 
another on cycle 0 (LDH x1), the only cycle for MPY p10 is cycle 6. However, 
another MPY on the A side is also scheduled on cycle 6 (MPY p00). Other 
combinations of cycles for this graph produce similar results. 
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6.11.2 Unrolled FIR Filter C Code 


The main limitation in solving the problem in Figure 6—24 is in scheduling a 2- 
cycle loop, which means that no value can be live more than two cycles. In- 
creasing the iteration interval to 3 decreases performance. A better solution 
is to unroll the inner loop one more time and produce a 4-cycle loop. 


Example 6-62 shows the FIR filter C code after unrolling the inner loop one 
more time. This solution adds to the flexibility of scheduling and allows you to 
write FIR filter code that never has memory hits, regardless of array alignment 
and memory block. 


Example 6-62. FIR Filter C Code (Unrolled) 


{ 


void fir(short x[], short h[], short y[]) 


int i, j, sum0, suml; 
short «0,x1,x2,x3,h0,h1,h2,h3; 


A 


for (j = 0; j 100; j+=2) { 


sum0 = 0; 
suml = 0; 
x0 = x[J]; 


, 

* 

suml += x3 * h2; 

x0 = x[jtit4 
h3 = h[it3]; 
sum0 += x3 * h3; 
suml += x0 * h3; 
} 

yij] = sum0 >> 15; 

y[jt+1] = suml >> 15; 
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6.11.3 Translating C Code to Linear Assembly 


Example 6-63 shows the linear assembly for the unrolled inner loop of the FIR 


filter C code. 


Example 6-63. Linear Assembly for Unrolled FIR Inner Loop 


[entr] 
fentr] 


Ss 
B 


DH 
DH 


DD 
DD 


UB 


*x++,X1 
*h++,ho0 
x0,h0,p00 
x1,h0,p10 
p00, sum0, sum0 
pl0,suml, suml 


esx ED. 
*ht+t+,hl 
x1,h1,p01 
x2,h1,pll 
p01, sum0, sum0 
pll,suml, suml 


de eae eo 
*ht++,h2 
x2,h2,p02 
x3,h2,p12 
p02, sum0, sum0 
pl2,suml,suml 


#34 EO) 
*ht++,h3 
x3,h3,p03 
x0,h3,p13 
p03, sum0, sum0 
pl3,suml,suml 


entr, 1, centr 
LOOP 


xl = x[jtit+l] 
ho = h[i] 

x0 * ho 

x1 * ho 

sum0 += x0 * hO 
suml += xl * hO 


hl = h[it+l] 
xl * hil 
x2 * hi 


sum0 += x3 * h3 
suml += x0 * h3 


decrement loop counter 
branch to loop 
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6.11.4 Drawing a Dependency Graph 


Figure 6-25 shows the dependency graph of the FIR filter with no memory 
hits. 


Figure 6-25. Dependency Graph of FIR Filter (With No Memory Hits) 


A side B side 


Optimizing Assembly Code via Linear Assembly 6-121 


Part Ill 


Part Ill 


Memory Banks 


6.11.5 Linear Assembly for Unrolled FIR Inner Loop With .mptr Directive 


Example 6—64 shows the unrolled FIR inner loop with the .mpir directive. The 
.mptr directive allows the assembly optimizer to automatically determine if two 
memory operations have a bank conflict by associating memory access infor- 
mation with a specific pointer register. 


If the assembly optimizer determines that two memory operations have a bank 
conflict, then it will not schedule them in parallel. The .mptr directive tells the 
assembly optimizer that when the specified register is used as amemory point- 
er in aload or store instruction, it is initialized to point at a base location + <off- 
set>, and is incremented a number of times each time through the loop. 


Without the .mpitr directives, the loads of x1 and hO are scheduled in parallel, 
and the loads of x2 and h1 are scheduled in parallel. This results in a 50% 
chance of a memory conflict on every cycle. 


Example 6-64. Linear Assembly for Full Unrolled FIR Filter 


_firt 


OUTLOOP: 


[oetr] 


-global 
~cproc 
.reg 


.reg 
.reg 


uUYNN 

Gh 
wD 
() 


_fir 

x, h, y 

x 1, h_l, sum0, suml, ctr, octr 

p00, pOl, p02, p03, p10, pll, pl2, p13 

x0, Ml, x2, =3, HO, Al, h2, h3, rstx, rsth 

h,2,h_1 ; set up pointer to h[1] 

50,octr ; outer loop ctr = 100/2 

64,rstx 7 used to rst x pointer each outer loop 
64,rsth 7 used to rst h pointer each outer loop 
pA aan ; set up pointer to x[j+1] 

ho 1, 2;h ; set up pointer to h[0] 

8,.cbe ; inner loop ctr = 32/2 

sum0 ; sum0 = 0 

suml ; suml = 0 

octr,1,octr ; decrement outer loop counter 

x BO) 

xl, x2 

iy h+0 

lol, Ine? 

»D2 RX+4 [2] 4 XO ; xO = x[j] 
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Example 6-64. Linear Assembly for Full Unrolled FIR Filter (Continued) 


LOOP: 


[eter] 
[ete] 


foctr] 


-trip 8 


LDH 
LDH 


wn 
c 
w 


Comm do 


.D1 ex 1++[2],x1 
.D1 *h++[2],h0 
.M1X x0,h0,p00 
.M1 x1,h0,p10 
-L1 p00, sum0, sum0 
-L2X pl0,suml1, suml 
.D2 *xt++[2],x2 
D2 *h_1++[2],h1 
2X x1,h1,p01 
.M2 x2, h1,p11 
.L1X p01, sum0, sum0 
~L2 pll,suml1, suml 
.D1 *y 144+[2],x3 
.D1 *ht++[2],h2 
.M1X x2,h2,p02 
x3,h2,p12 
Pa al p02,sum0, sum0 
-L2X pl2,suml1, suml 
D2 *x++[2],x0 
.D2 *h_1+4+[2],h3 
2X x3,h3,p03 
.M2 x0, 3,513 
.L1X p03, sum0, sum0 
-L2 pl3,suml1, suml 
82 ere, 1, ctx 
.S2 LOOP 


sum0,15, sum0 
suml1,15, suml 
sum0, *y++ 
suml, *y++ 
x; EStx; x 
h_1,rsth,h_1 
OUTLOOP 


decrement loop counter 
branch to loop 


sum0 >> 15 

suml >> 15 

y[j] = sum0 >> 15 
y[jt1] = suml >> 15 
reset x pointer to x[j] 
reset h pointer to h[0] 
branch to outer loop 
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6.11.6 Linear Assembly Resource Allocation 
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As the number of instructions in a loop increases, assigning a specific register 
to every value in the loop becomes increasingly difficult. If 33 instructions in 
a loop each write a value, they cannot each write to a unique register because 
the ’C6x has only 32 registers. As a result, values that are not live on the same 
cycles in the loop must share registers. 


For example, in a 4-cycle loop: 


_j Ifavalue is written at the end of cycle 0 and read on cycle 2 of the loop, 
it is live for two cycles (cycles 1 and 2 of the loop). 


Lj Ifanother value is written at the end of cycle 2 and read on cycle 0 (the next 
iteration) of the loop, itis also live for two cycles (cycles 3 and 0 of the loop). 


Because both of these values are not live on the same cycles, they can occupy 
the same register. Only after scheduling these instructions and their children 
do you know that they can occupy the same register. 


Register allocation is not complicated but can be tedious when done by hand. 
Each value has to be analyzed for its lifetime and then appropriately combined 
with other values not live on the same cycles in the loop. The assembly opti- 
mizer handles this automatically after it software pipelines the loop. See the 
TMS320C6x Optimizing C Compiler User’s Guide for more information. 
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6.11.7 Determining the Minimum Iteration Interval 


Based on Table 6—24, the minimum iteration interval for the FIR filter with no 
memory hits should be 4. An iteration interval of 4 means that two multiply/ac- 
cumulates still execute per cycle. 


Table 6-24. Resource Table for FIR Filter Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit | Unit(s) Instructions Total/Unit 
.M1 4 MPYs 4 | .M2 4 MPYs 4 

S1 0 .S2 B 1 

.D1 4 LDHs 4 .D2 4 LDHs 4 
.L1,.S1,or.D1  4ADDs 4 .L2,.S2,or.D2 4ADDs and SUB 5 
Total non-.M units 8 | Total non-.M units 10 

1X paths 4 | 2X paths 4 


6.11.8 Final Assembly 


Example 6-65 shows the final assembly to the FIR filter with redundant load 
elimination and no memory hits. At the end of the inner loop, there is a branch 
to OUTLOOP to execute the next outer loop. The outer loop counter is set to 
50 because iterations j and j+1 are executing each time the inner loop is run. 
The inner loop counter is set to 8 because iterations i,i+1,i+2,andi+3 are 
executing each inner loop iteration. 
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6.11.9 Comparing Performance 


The cycle count for this nested loop is 2402 cycles. There is a rather large 
outer-loop overhead for executing the branch to the outer loop (6 cycles) and 
the inner loop prolog (10 cycles). Section 6.12 addresses how to reduce this 
overhead by software pipelining the outer loop. 


Table 6-25. Comparison of FIR Filter Code 


Code Example Cycles Cycle Count 

Example 6-60 FIR with redundant load elimination 50 (16 xX 2+9+6)+2 2352 

Example 6-65 FIR with redundant load elimination and no 50 (8 x 44+ 10+6)+2 2402 
memory hits 
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Example 6-65. Final Assembly Code for FIR Filter With Redundant Load Elimination 


and No Memory Hits 
MVK -S1 50,A2 ; set up outer loop counter 
MVK ~S1 62,A3 ; used to rst x pointer outloop 
|| MVK eo 64,B10 ; used to rst h pointer outloop 
OUTLOOP: 
LDH .D1 *A4++,B5 ; x[j] 
ADD » L2X A4,4,Bl1 ; set up pointer to x[j+2] 
ADD .L1X B4,2,A8 ; set up pointer to h[1] 
MVK 82 8,B2 ; set up inner loop counter 
[A2] SUB oO A2,1,A2 ; decrement outer loop counter 
LDH ~D2 *B1++[2],B0O 7; *2 = x[j+it2] 
LDH .D1 *R4++[2],A0 ; xl = x[jtitl] 
ZERO L1 AQ ; zero out sum0 
ZERO L2 BY ; zero out suml 
LDH .D1 *A8++[2],B6 ; hl = h[itl 
LDH D2 *B4++[2],Al ; ho = h{[i 
LDH + Dab *A44+4+[2],A5 ; x3 = x[j+it+3] 
LDH D2 *B14++[2],B5 ; xO = x[j+it+4] 
LDH D2 *B4++[2],A7 7; h2 = h[it2 
LDH .D1 *A8++[2],B8 ; h3 = h[it3 
[B2] SUB ~S2 B2Z,;1;,B2 ; decrement loop counter 
LDH D2. *B1++[2],B0 7* x2 = x[j+it2] 
LDH .D1 *R4++[2],A0 7* xl = x[j+titl] 
LDH .D1 *A8++[2],B6 7;* hl = h[itl 
LDH D2 *B4++[2],Al 7;* ho = h[i 
MPY 1X B5,A1,A0 ¢- xO * iO 
MPY .M2X A0O,B6,B6 * x1 * hi 
LDH .D1 *A4++[2],A5 7* x3 = x[j+it3] 
LDH D2 *B1++[2],B5 7* xO = x[j+it4] 
{B2] B eek LOOP ; branch to loop 
MPY 2 BO,B6,B7 7; x2 * hil 
MPY 1 AO,Al1,Al1 ; xl * ho 
LDH .D2 *B4++[2],A7 7* h2 = h[it2] 
LDH papi *A8++[2],B8 7* h3 = h[it3] 
[B2] SUB +S2 B2,1,B2 7* decrement loop counter 
ADD Ll AO,A9,A9 ; sum0O += x0 * hO 
MPY -M2X A5,B8,B8 Hee: <> camel ales 
MPY .M1X BO,A7,A5 ; x2 * h2 
LDH ~D2 *B1++[2],B0 7** x2 = x[jt+it2] 
LDH .D1 *R4++[2],A0 7** xl = x[jtitl] 


6-126 


Example 6—65. Final Assembly Code for FIR Filter With Redundant Load Elimination 


and No Memory Hits (Continued) 


Memory Banks 


LOOP: 


[B2] 
[B2] 


[B2] 
[B2] 


[B2] 


[B2] 
[B2] 
[B2] 


[B2] 
[B2] 


ADD L2X 
ADD L1X 
MPY 2 
MPY 1 
LDH D1 
LDH D2 
ADD L2 
ADD L1 
MPY M1X 
MPY 2X 
LDH D1 
LDH D2 
ADD L2X 
ADD .L1X 
B .S1 
MPY 2 
MPY M1 
LDH D2 
LDH D1 
SUB $2 
ADD 2 
ADD ~L1 
MPY .M2X 
MPY .M1X 
LDH .D2 
LDH D1 


Y 


Al,B9,B9 ; suml += x1 * HO 
B6,A9,A9 ; sumO += xl * hl 
B5,B8,B7 ; x0 * h3 
A5,A7,A7 ; x3 * h2 
*A8++[2],B6 7** hl = h[itl] 
*B4++[2],Al1 ;** ho h[i] 
B7,B9,B9 ; suml += x2 * Al 
A5,A9,A9 ; sum0 += x2 * h2 
B5,A1,A0 7* x0 * ho 
A0O,B6,B6 pe Sl tal 
*A4++[2],A5 PR x3 x [j+it+3] 
*Bl++[2],B5 7** xO = x[j+it4] 
A7,B9,B9 ; suml += x3 * h2 
B8,A9,A9 , sum0O += x3 * h3 
LOOP 7* branch to loop 
BO,B6,B7 ee 2) * Tad 
AO,A1,Al1 7;* xl * HO 
*B4++[2],A7 7** h2 = h[it2] 
*A8++[2],B8 7** h3 = h[it3] 
B2,;.1;B2 ;** decrement loop counter 
B7,B9,B9 ; suml += x0 * h3 
AO,A9,A9 ;* sumO += x0 * hO 
A5,B8,B8 ;* =3 * hs 
BO,A7,A5 ;* x2 * h2 
*B1++[2],B0 7*** x2 = x[Jt+it2] 
*A4++[2],A0 7*** x1 = x[Jtitl] 


inner loop branch occurs here 


B $2 
SUB wad. 
SUB L2 
SUB S 
SHR S 
SHR $2 
STH D 
STH D 
NOP 2 


a 


OUTLOOP 
A4,A3,A4 
B4,B10,B4 
A9,A0,A9 


A9,15,A9 
B9;15;,B9 


AQ, *A6++ 


B9, *A6++ 


outer loop branch occurs 


here 


branch to outer loop 

reset x pointer to x[j] 

reset h pointer to h[0] 

sum0O -= x0*hO (eliminate add) 


sum0 >> 15 
suml >> 15 


y[j] = sum0 >> 15 


y[jt+1] = suml >> 15 


branch delay slots 
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6.12 Software Pipelining the Outer Loop 


In previous examples, software pipelining has always affected the inner loop. 
However, software pipelining works equally well with the outer loop ina nested 
loop. 


6.12.1 Unrolled FIR Filter C Code 


Example 6-66 shows the FIR filter C code after unrolling the inner loop (identi- 
cal to Example 6-62 on page 6-119). 


Example 6-66. Unrolled FIR Filter C Code 


{ 


void fir(short x[], short h[], short y[]) 


int i, j, sum0, suml; 
short x0,x1,x2,x%3,h0,h1,h2,h3; 


for (j = 0; j < 100; j+=2) { 


sum0 = 0; 

suml = 0; 

x0 = x[Jl; 

for (i = 0; i < 32; i+=4){ 
xl = x[jtitl]; 
ho = h[i]; 
sum0 4 x0 * ho; 


suml += x0 * h3; 
} 

ytj] = sum0 >> 15; 

y[jtl] = suml >> 15; 
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6.12.2 Making the Outer Loop Parallel With the Inner Loop Epilog and Prolog 


The final assembly code for the FIR filter with redundant load elimination and 
no memory hits (Shown in Example 6—65 on page 6-126) contained 16 cycles 
of overhead to call the inner loop every time: ten cycles for the loop prolog and 
six cycles for the outer loop instructions and branching to the outer loop. 


Most of this overhead can be reduced as follows: 


(} Put the outer loop and branch instructions in parallel with the prolog. 
(1 Create an epilog to the inner loop. 
[4 Put some outer loop instructions in parallel with the inner-loop epilog. 


6.12.3 Final Assembly 


Example 6-67 shows the final assembly for the FIR filter with a software-pipe- 
lined outer loop. Below the inner loop (starting on page 6-131), each instruc- 
tion is marked in the comments with an e, p, or o for instructions relating to epi- 
log, prolog, or outer loop, respectively. 


The inner loop is now only run seven times, because the eighth iteration is 
done in the epilog in parallel with the prolog of the next inner loop and the outer 
loop instructions. 
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Example 6-67. Final Assembly Code for FIR Filter With Redundant Load Elimination and 
No Memory Hits With Outer Loop Software-Pipelined 


MVK -S1 50,A2 ; set up outer loop counter 
STW D2 Bll, *B15-- ; push register 
MVK ~S1 74,A3 ; used to rst x ptr outer loop 
MVK -S2 72,B10 ; used to rst h ptr outer loop 
ADD .L2X A6,2,B11 ; set up pointer to y[1] 
LDH {pL *R4++,B8 ; x0 = x[j] 
ADD L2X A4,4,Bl1 ; set up pointer to x[j+2] 
ADD L1X B4,2,A8 ; set up pointer to h[1] 
MVK S2 8,B2 ; set up inner loop counter 
[A2] SUB iS) A2,1,A2 ; decrement outer loop counter 
LDH .D2 *B1++[2],B0O ; x2 = x[Jjt+it2] 
LDH D *R4++[2],A0 ; xl = x[jtitl] 
ZERO -L1 AY ; zero out sum0 
ZERO » le BY ; zero out suml 
LDH papi *A8++[2],B6 ; hl = h[itl 
LDH D2 *B4++[2],Al ; hO = h[i 
LDH DL *A44+4+[2],A5 ; x3 = x[j+it3] 
LDH -D2 *B1++[2],B5 ; xO = x[j+it+4] 
OUTLOOP: 
LDH -D2 *B4++[2],A7 ; h2 = h[it2] 
LDH 3 Dat *A8++[2],B8 ; h3 = h[it3] 
[B2] SUB ow BZ, 2, Be ; decrement loop counter 
LDH -D2 *B1++[2],B0O 7* x2 = x[j+it2] 
LDH -D1 *A4++[2],A0 7;* xl = x[jtitl] 
LDH -D1 *A8++[2],B6 7* hl = h[itl 
LDH .D2 *B4++[2],Al 7* hO = hfi 
MPY -M1X B8,A1,A0 ; xO * HO 
MPY .M2X A0,B6,Bé6 7x x1 ~* hi 
LDH -D1 *A4++[2],A5 7* x3 = x[j+it3] 
LDH .D2 *B1l++[2],B5 7* xO = x[j+it+4] 
{B2] B Pron LOOP ; branch to loop 
MPY .M2 BO,B6,B7 * x2 * hi 
MPY -M1 AO,A1,Al1 eo xi * AO 
LDH .D2 *B4++[2],A7 ;* h2 = h[it2] 
LDH «Dil *A8++[2],B8 7* h3 = A[it3] 
[B2] SUB oe B2,1,B2 7* decrement loop counter 
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Example 6-67. Final Assembly Code for FIR Filter With Redundant Load Elimination and 
No Memory Hits With Outer Loop Software-Pipelined (Continued) 


ADD -L1 AO,A9,A9 ; sumO += x0 * hO 
| MPY .M2X A5,B8,B8 3. FS 
| MPY .M1X BO,A7,A5 ° x2 * h2 
| LDH +D2 *B1++[2],B0 p** x2 = x[Jj+it2] 
| LDH .D1 *A4++[2],A0 ;** xl = x[Jjtit+l] 
LOOP 
ADD .L2X Al,B9,B9 , suml += xl * hO 
ADD .L1X B6,A9,A9 ; sum0O += x1 * Al 
MPY .M2 B5,B8,B7 ; x0 * h3 
MPY .M1 A5,A7,A7 pxo * “AZ 
LDH .D1 *A8++[2],B6 7** hl = h[itl] 
LDH ~D2 *B4++[2],Al 7;** ho = h[i] 
ADD Le B7,B9,B9 , suml += x2 * hl 
ADD pein A5,A9,A9 ; sum0 += x2 * h2 
MPY .M1X B5,A1,A0 7* xO * ho 
MPY .M2X AO,B6,B6 ;* xl * Al 
LDH .D1 *A4++[2],A5 7** x3 = x[Jt+it3] 
LDH D2 *Bl++[2],B5 7** xO = x[j+it4] 
ADD .L2X A7,B9,B9 , suml += x3 * h2 
ADD .L1X B8,A9,A9 ; sum0 += x3 * h3 
{B2] B .S1 LOOP 7* branch to loop 
MP Y .M2 BO,B6,B7 ;* x2 * hil 
MPY .M1 AO,A1,A1 oe Se AO 
LDH +D2 *B4++[2],A7 7** h2 = h[it2] 
LDH .D1 *A8++[2],B8 7** h3 = h[it3] 
{B2] SUB 282 B2,1,B2 7** decrement loop counter 
ADD .L2 B7,B9,B9 j suml += x0 * h3 
ADD -L1 AO,A9,A9 ;* sumO += x0 * hO 
MPY .M2X A5,B8,B8 be x3 * hs 
MPY .M1X BO,A7,A5 ;* x2 * h2 
LDH -D2 *B1++[2],B0 7*** x2 = x[Jt+it2] 
LDH .D1 *A4++[2],A0 7*** xl = x[Jtitl] 
; inner loop branch occurs here 
ADD .L2X Al,B9,B9 7e suml += xl * hO 
(| ADD .L1X B6,A9,A9 7e sumO0 += xl * hl 
1 | MPY .M2 B5,B8,B7 7e x0 * h3 
I] MP Y .M1 A5,A7,A7 7e x3 * h2 
\ | SUB .D1 A4,A3,A4 ;o reset x pointer to x[Jj] 
\ | SUB .D2 B4,B10,B4 ;O reset h pointer to h[0] 
| | [A2] B .S1 OUTLOOP 7° branch to outer loop 
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Example 6-67. Final Assembly Code for FIR Filter With Redundant Load Elimination and 
No Memory Hits With Outer Loop Software-Pipelined (Continued) 


ADD ~D2 B7,B9,B9 
ADD ekidk A5,A9,A9 
LDH .D1 *R4++,B8 
ADD .L2X A4,4,Bl1 
ADD .S1X B4,2,A8 
MVK «2 8,B2 
ADD .L2X A7,B9,B9 
ADD -L1X B8,A9,A9 
LDH -D2 *B1++[2],BO 
LDH .D1 *A4++[2],A0 
[A2] SUB .S1 A2,1,A2 
ADD -L2 B7,B9,B9 
SHR Sl A9,15,A9 
LDH {Di *A8++[2],B6 
LDH :D2 *B4++[2],Al1 
SHR ow B9,15,B9 
| LDH Pipa *A4+4+[2],A5 
| LDH .D2 *B1++[2],B5 
STH -D1 A9, *A6++[2] 
| STH .D2 B9, *B11++[2] 
| ZERO ool AY 
| ZERO ~S2 BY 
; outer loop branch occurs here 


;e 
je 
7P 
Fae) 
Fire) 
70 


je 
;e 
7P 
7P 
Fe) 


je 
;e 
7P 
7P 


je 
7P 
7P 


je 
je 
Fae) 
70 


suml 
sum0 
x0 = 


set up pointer to x[j+2] 
set up pointer to h[1] 
set up inner loop counter 


suml 
sum0 


xl = 


decrement outer loop counter 


suml 
sum0 
hl = 
ho = 


y[jl 


y[j+1] = 


Zero 
zero 


+= x2 * hl 
+= x2 * h2 
x[j] 


x[j+i+1] 


+= x0 * h3 
>> 15 
h[itl] 
h[i 


x(jtit3] 
x[jtit4] 


= sum0 >> 15 
suml >> 15 
out sum0 

out suml 


6.12.4 Comparing Performance 


The improved cycle count for this loop is 2006 cycles: 50 ((7 x 4) +6 +6) +6. The 
outer-loop overhead for this loop has been reduced from 16 to 8 (6 + 6 — 4); 
the —4 represents one iteration less for the inner-loop iteration (seven instead 


of eight). 


Table 6-26. Comparison of FIR Filter Code 


Code Example 


Example 6-60 
Example 6-65 


Example 6-67 


6-132 


FIR with redundant load elimination 


FIR with redundant load elimination and no memory 


hits 


FIR with redundant load elimination and no memory 


hits with outer loop software-pipelined 


Cycles Cycle Count 
50 (16 x 2+9+6)+2 2352 
50 (8 x 4+104+6)+2 2402 
50 (7 xX 4+6+6)+6 2006 
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6.13 Outer Loop Conditionally Executed With Inner Loop 


Software pipelining the outer loop improved the outer loop overhead in the 
previous example from 16 cycles to 8 cycles. Executing the outer loop condi- 
tionally and in parallel with the inner loop eliminates the overhead entirely. 


6.13.1 Unrolled FIR Filter C Code 


Example 6-68 shows the same unrolled FIR filter C code that used in the 
previous example. 


Example 6-68. Unrolled FIR Filter C Code 


void fir(short x[], short h[], short y[]) 
{ 

int i, j, sum0, suml; 

short «0,x1,x2,x3,h0,h1,h2,h3; 


for (j = 0; 3} < 100; 43+=2) { 


sum0 = 0; 
suml = 0; 
x0 = x[9]; 
for (i = 0; i < 32; i+=4){ 


xl = x[jtit+l]; 
ho = h[il; 

sum0O += x0 * hO; 
suml += x1 * hO; 
x2 x[Jjt+it2]; 
hl = h[itl]; 
sum0 += x1 * hil; 
suml += x2 * hil; 
x3 x[jt+it+3]; 
h2 = h[it2]; 
sum0 += x2 * h2; 
suml x3 * h2* 
xO = x[j+it+4]; 
h3 = h[it3]; 
sum0 += x3 * h3; 
suml += x0 * h3; 


ytj] = sum0 >> 15; 
yljt1] = suml >> 15; 
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6.13.2 Translating C Code to Linear Assembly (Inner Loop) 


Example 6-69 shows alist of linear assembly for the inner loop of the FIR filter 
C code (identical to Example 6-63 on page 6-120). 


Example 6-69. Linear Assembly for Unrolled FIR Inner Loop 


LDH ext++,x1 ; xl = x[jt+it+l] 

LDH *h++,ho0 ; ho = h[i] 

MPY x0,h0,p00 7; xO * hd 

MPY x1,h0,p10 , xi oho 

ADD p00, sum0, sum0 ; sumO += x0 * hO 

ADD pl0,suml, suml ; suml += x1 * hO 

LDH ex++, XZ ; *2 = x[j+it+2] 

LDH *ht+t+,hl ; hl = h[itl] 

MPY x1,h1,p01 7 x1 * hi 

MPY x2,h1,pll 3; #2 * hl 

ADD p01, sum0, sum0 ; sumO += xl * hl 

ADD pll,suml, suml ; suml += x2 * hl 

LDH de eae eo ; *3 = x[j+it+3] 

LDH *ht++,h2 ; h2 = h[it2] 

MPY x2,h2,p02 i Ral * he 

MPY x3,h2,p12 } x3 * h2 

ADD p02, sum0, sum0 7; sum0 += x2 * h2 

ADD pl2,suml,suml 7; suml += x3 * h2 

LDH #34 EO) ; xO = x[j+it+4] 

LDH *ht++,h3 ; h3 = h[it3] 

MPY x3,h3,p03 ; x3 * h3 

MPY x0,h3,p13 ; xO * h3 

ADD p03, sum0, sum0 j sumO += x3 * h3 

ADD pl3,suml,suml ; suml += x0 * h3 
[cntr] SUB entr, 1, centr ; decrement loop counter 
[cntr] B LOOP ; branch to loop 
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6.13.3 Translating C Code to Linear Assembly (Outer Loop) 


Example 6-70 shows the instructions that execute all of the outer loop func- 
tions. All of these instructions are conditional on inner loop counters. Two 
different counters are needed, because they must decrement to 0 on different 
iterations. 


_j The resetting of the x and h pointers is conditional on the pointer reset 
counter, pre. 


_] The shifting and storing of the even and odd y elements are conditional on 
the store counter, sctr. 


When these counters are 0, all of the instructions that are conditional on that 
value execute. 


Lj The MVK instruction resets the pointers to 8 because after every eight 
iterations of the loop, a new inner loop is completed (8 x 4 elements are 
processed). 


(1 The pointer reset counter becomes 0 first to reset the load pointers, then 
the store counter becomes 0 to shift and store the result. 


Example 6—70. Linear Assembly for FIR Outer Loop 


[sectx 
'sctr 
lsctr 


SUB 
SHR 
SHR 
STH 
STH 
MVK 
SUB 
SUB 
SUB 
SUB 
SUB 
MVK 


sctr,1,Sctr 


dec store lp cntr 


sum07,15,y0 (sum0 >> 15) 
sum17,15,yl (suml >> 15) 

yO, *y++[2] y[jl (sum0 >> 15) 
yl, *y_1++[2] y{j+1] = (suml >> 15) 
4,sctr reset store lp cntr 


pcetr,1,pctr 


ee Te 


dec pointer reset lp cntr 


x, rstx2,x reset x ptr 

x_l,rstx1l,x_l reset x_l ptr 

h,rsthl,h reset h ptr 

h_1,rsth2,h_1 reset h_1 ptr 

4,pctr reset pointer reset lp cntr 


6.13.4 Unrolled FIR Filter C Code 


The total number of instructions to execute both the inner and outer loops is 
38 (26 for the inner loop and 12 for the outer loop). A 4-cycle loop is no longer 
possible. To avoid slowing down the throughput of the inner loop to reduce the 
outer-loop overhead, you must unroll the FIR filter again. 


Example 6—71 shows the C code for the FIR filter, which operates on eight 
elements every inner loop. Two outer loops are also being processed together, 
as in Example 6-68 on page 6-133. 
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Example 6-71. Unrolled FIR Filter C Code 


void fir(short x[], short h[], short y[]) 
{ 
int i, j, sum0, suml; 
short x0,x1,x2,x3,x4,x5,x6,x7,h0,h1,h2,h3,h4,h5,h6,h7; 
for (j = 0; 3 < 100; jt+=2) { 
sum0 = 0; 
suml = 0; 
x0 = x[J]; 
for (i = 0; i < 32; i+=8){ 
xl = x[jtitl]; 
ho = h[il; 
sumO += x0 * hO; 
suml += xl * hO; 
x2 = x[Jj+it2]; 
hl = h[itl]; 
sumO += xl * hl; 
suml += x2 * hl; 
x3 = x[j+it+3]; 
h2 = h[it2]; 
sumO += x2 * h2; 
suml += x3 * h2; 
x4 = x[j+it4]; 
h3 = h[it+3]; 
sum0O += x3 * h3; 
suml += x4 * h3; 
x5 = x[j+it5]; 
h4 = h[it4]; 
sumO += x4 * h4; 
suml += x5 * h4; 
x6 = x[j+it+6]; 
h5 = h[it5]; 
sum0O += x5 * hd; 
suml += x6 * h5; 
x7 = x[j+it7]; 
h6 = h[it6]; 
sum0O += x6 * h6; 
suml += x7 * h6; 
x0 = x[j+it+8]; 
h7 = h[it7]; 
sumO += x7 * h7; 
suml += xO * h7; 
} 
ytj] = sum0 >> 15; 
y(jt1] = suml >> 15; 
} 
} 


6-136 


Outer Loop Conditionally Executed With Inner Loop 


6.13.5 Translating C Code to Linear Assembly (Inner Loop) 


Example 6—72 shows the instructions that perform the inner and outer loops 
of the FIR filter. These instructions reflect the following modifications: 


L] 
L) 


LDWs are used instead of LDHs to reduce the number of loads in the loop. 
The reset pointer instructions immediately follow the LDW instructions. 


The first ADD instructions for sum0 and sum1 are conditional on the same 
value as the store counter, because when sctr is 0, the end of one inner 
loop has been reached and the first ADD, which adds the previous sum07 
to p00, must not be executed. 


The first ADD for sum0 writes to the same register as the first MPY p00. 
The second ADD reads p00 and p01. At the beginning of each inner loop, 
the first ADD is not performed, so the second ADD correctly reads the 
results of the first two MPYs (p01 and p00) and adds them together. For 
other iterations of the inner loop, the first ADD executes, and the second 
ADD sums the second MPY result (p01) with the running accumulator. The 
same is true for the first and second ADDs of sum1. 
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Example 6—72. Linear Assembly for FIR With Outer Loop Conditionally Executed 


With Inner Loop 
LDW *ht++[2],h01 ; h[itO] & h[itl 
LDW *h_1++[2],h23 ; h[it2] & h[it3 
LDW *h++[2],h45 ; h[it4] & h[it5 
LDW -*h_1++[2],h67 ; h[ité] & h[it7 
LDW *x++[2],x01 ; x[{jtit0] & x[Jj+it+1] 
LDW *x 14+4+[2],x23 ; x[Jtit2] & x[Jj+i+3] 
LDW *x++[2],x45 ; x[jtit4] & x[Jj+it5] 
LDW *x 14+4+[2],x67 ; x[jtité6] & x[Jj+it+7] 
LDH *x, x8 ; x[jt+it8] 
[sctr] SUB Ssctr, 1, Sctr ; Gec store lp cntr 
[!setr] SHR sum07,15,y0 3 {(sum0 >> 15) 
[!sctr] SHR sum17,15,yl ; (suml >> 15) 
[teeter] STH yO, *yt++[2] ; y[3] = (sum0 >> 15) 
['sctr] STH yl, *y_1++[2] > yljtl] = (suml >> 15) 
MV x01,x01b ; move to other reg file 
MP YLH hO1,x01b,p10 ; plO = h[it0O]*x[j+it+l 
[scte] ADD pl0,suml17,p10 ; suml(p10) = p10 + suml 
MPYHL h01,x23,pl11 , pll = h[itl]*x[Jj+it+2 
ADD pll,p10,suml1l1 7 suml += pll 
MP YLH h23,x23,pl12 ; pl2 = h[it2]*x[j+i+3 
ADD pl2,suml1,suml12 ; suml += pl2 
MPYHL h23,x45,p13 ; pl3 = h[it3]*x[j+i+4 
ADD pl3,suml2,suml13 ; suml += pl3 
MPYLH h45,x45,pl14 ; pl4 = h[it4]*x[j+it+5 
ADD pl4,suml13,suml14 ; suml += pl4 
MPYHL h45,x67,pl15 ; pl5S = h[it5]*x[j+i+6 
ADD pl5,suml14,sum15 ; suml += pl5 
MP YLH h67,x67,pl16 ; pl6 = h[it6]*x[j+it+7 
ADD pl6,sum15,suml6 ; suml += pl6 
MPYHL h67,x8,pl17 ; pl7 = h[it7]*x[j+i+8 
ADD pl7,suml6,sum17 ; suml += pl7 
MPY hQ01,x01,p00 ; pOO = h[it0]*x[j+i+0 
[sctr] ADD p00, sum07, p00 7 sum0(p00) = p00 + sum0 
MP YH hO1,x01,p01 ; pOl = h[itl]*x[Jj+itl 
ADD p01,p00, sum01 ; sum0 += pol 
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Example 6—72. Linear Assembly for FIR With Outer Loop Conditionally Executed 
With Inner Loop (Continued) 


PY h23,x23,p02 
ADD p02,sum01, sum02 
PYH hn23,x23,p03 
ADD p03,sum02,sum03 
PY h45,x45,p04 
ADD p04,sum03, sum04 
PYH h45,x45,p05 
ADD p05,sum04,sum05 
PY h67,x67,p06 
ADD p06, sum05, sum06 
PYH h67,x67,p07 
ADD p07, sum06, sum07 
'sctr VK 4,sctr 
{pctr SUB pcetr,1,pctr 
lpetr SUB xX, Cstx2,x 
lpetr SUB x1, PStxi, x1 
'pcetr SUB h,rsthl,h 
'petr SUB h_1,rsth2,h_1 
!petr MVK 4,pctr 
[octr SUB octr, 1,ecrr 
foctr B LOOP 


’ 


’ 


Nee Ne Ne Ne Ne 


p02 = h[it2]*x[jtit2 
sum0 += p02 


p03 = h[it3]*x[j+it+3 
sum0 += p03 


p04 = h[it4]*x[j+it4 
sum0 += p04 


p0OS = h[it5]*x[j+it+5 
sum0 += p05 


p06 = h[it6]*x[j+it+6 
sum0 += p06 


pO7 = h[it7]*x[j+it7 
sum0 += p07 


reset store lp cntr 


dec pointer reset lp cntr 
reset x ptr 

reset x_l ptr 

reset h ptr 

reset h_l ptr 

reset pointer reset lp cntr 


dec outer lp cntr 
Branch outer loop 


6.13.6 Translating C Code to Linear Assembly (Inner Loop and Outer Loop) 


Example 6—73 shows the linear assembly with functional units assigned. (As 


in Example 6-64 on page 6-122, symbolic names now have an A or B in front 


of them to signify the register file where they reside.) Although this allocation 


is one of many possibilities, one goal is to keep the 1X and 2X paths toa 


minimum. Even with this goal, you have five 2X paths and seven 1X paths. 


One requirement that was assumed when the functional units were chosen 
was that all the sum0 values reside on the same side (A in this case) and all 
the sum1 values reside on the other side (B). Because you are scheduling 


eight accumulates for both sum0 and sum1 in an 8-cycle loop, each ADD must 


be scheduled immediately following the previous ADD. Therefore, it is undesir- 
able for any sum0O ADDs to use the same functional units as sum1 ADDs. 


One MV instruction was added to get x01 on the B side for the MPYLH p10 


instruction. 
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Example 6—73. Linear Assembly for FIR With Outer Loop Conditionally Executed 
With Inner Loop (With Functional Units) 


Pits 


LOOP: 


[sectr] 
'sctr] 
'sctr] 
'sctr] 
lsetr] 


-global 
.~cproc 


.reg 
.reg 
.reg 
.reg 
.reg 
.reg 
.reg 


_fir 

x, h, y 

x1, hol, youl; ctr, petr, sctr 

sum01, sum02, sum03, sum04, sum05, sum06, sum07 

sumll, suml2, sum13, suml4, suml15, suml6, suml7 

p00, pol, p02, p03, p04, p05, p06, pd7 

ploO, pll, p12, p13, p14, p15, pl6, pl7 

xO0lb, x01, x23, x45, x67, x8, hO1, h23, h45, h67 

yO, yl, rstxl, rstx2, rsthl, rsth2 

x,4,x_l1 ; point to x[2] 

h,4,h_1 ; point to h[2] 

yr2,y_1 ; point to y[1] 

60,rstxl ; used to rst x pointer each outer 
60,rstx2 ; used to rst x pointer each outer 
64, rsthl ; used to rst h pointer each outer 
64,rsth2 ; used to rst h pointer each outer 
201,octr ; loop ctr = 201 = (100/2) * (32/8) 
4,pctr j; pointer reset lp cntr = 32/8 
5,SCEr ; reset store lp cntr = 32/8 + 1 
sum07 ; sum07 = 0 

suml17 ; suml7 = 0 

x, x+0 

x 1, x4 

h, h+0 

h_1, ht+4 

-DIT1L *h++[2],hOl ; h[itO] & h[itl 

.D2T2 *h_1++[2],h23; h[it2] & h[it3 

-DiTdi *h++[2],h45 ; h[it4] & h[it5 

.D2T2 *h_1++[2],h67; h[it6é] & h[it7 

~bD2T 1 *x++[2],x01 ; x[j+it0] & x[jtit+l] 

sDiT2 kx 14+4+[2],x23; x[Jj+it2] & x[j+i+3] 

-D2T1 *x++[2],x45 7 x[Jj+it4] & x[j+it5] 

«DIT2 xx 14++[2],x67; x[j+it+6] & x[j+it7] 

eb2 Td *x, x8 ; x[jt+it8] 

vou. sctr,1,setr ; dec store lp cntr 

» Si sum07,15,y0 ; (sum0O >> 15) 

a2 sum17,15,yl ; (suml >> 15) 

poe yO,*y++[2] ; ylj] = (sum0 >> 15) 

.D2 yl,*y_1++[2] ; y[lj+l] = (suml >> 15) 


loop 
loop 
loop 
loop 
+ 1 
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Example 6—73. Linear Assembly for FIR With Outer Loop Conditionally Executed 
With Inner Loop (With Functional Units) (Continued) 


MV .L2X x01,x01b ; move to other reg file 
MPYLH .M2X h01,x01b, p10 ; plO = h[it0O]*x[j+itl] 
{sctr] ADD ~L2 pl0,sum17,p10 7 suml(p10) = pl0O + suml 

PYHL .M1X h01,x23,p11 ; pll = h[it+l]*x[j+it2 
ADD ~L2X pll1,p10,suml1l 7 suml += pll 

PYLH M2 h23,x%23, p12 7 pl2 = h[it+2]*x[Jj+it+3 
ADD .L2 pl2,suml11,suml12 ; suml += pl2 

PYHL .M1X h23,x45,p13 7 pl3 = h[it+3]*x[j+it4 
ADD .L2X pl3,suml2,suml13 , suml += pl3 

PYLH M1 h45,x45,p14 7 pl4 = h[it+4]*x[j+it5 
ADD .L2X pl4,sum13,suml14 ; suml += pl4 

PYHL .M2X h45,x67,pl15 7 plS = h[it+5]*x[j+it+6 
ADD oe pl5,suml14,sum15 ; suml += pl5 

PYLH M2 h67,x67,p16 7 pl6 = h[it+6]*x[j+it7 
ADD ~L2 pl6,sum15,suml16 ; suml += pl6é 

PYHL .M1X h67,x8,pl17 ; pl7 = h[it7]*x[j+it8 
ADD .L2X pl7,suml16,suml17 , suml += pl7 

PY. M1 hO1,x01, p00 ; pOO = h[it0]*x[j+i+0 

{sctr] ADD -L1 p00, sum07, p00 , sum0(p00) = p0dO + sum0 

PYH M1 h01,x01,p01 ; pOl = h[itl]*x[j+itl 
ADD ay Fai p01,p00, sum01 , sum0 += pol 

PY .M2 h23,x23,p02 ; p02 = h[it2]*x[ j+it2 
ADD .L1X p02,sum01, sum02 , sum0 += p02 

PYH -M2 h23,x23,p03 ; p03 = h[it3]*x[j+it+3 
ADD .L1xX p03,sum02, sum03 , sum0 += p03 

PY -M1 h45,x45,p04 ; p04 = h[it4]*x[ j+it+4 
ADD -L1 p04, sum03, sum04 ; sum0 += p04 

PYH M1 h45,x45,p05 ; pOS = h[it5]*x[j+it5 
ADD «id p05,sum04, sum05 7 sum0 += p05 
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Example 6—73. Linear Assembly for FIR With Outer Loop Conditionally Executed 
With Inner Loop (With Functional Units)(Continued) 


MPY .M2 h67,x67,p06 ; p06 = h[it6]*x[j+it+6] 
ADD + LIX p06, sum05, sum06 ; sum0 += p06 
MPYH .M2 h67,x67,p07 ; pO7 = h[it7]*x[j+it+7] 
ADD »L1X p07, sum06, sum07 ; sum0 += p07 
lsetr MVK reo 4,sctr ; reset store lp cntr 
[pctr SUB 7Si pctr,1,pctr ; dec pointer reset lp cntr 
{pCctr SUB »S2 x, rstx2,x ; reset x ptr 
!pctr SUB »Si xl rstxd, x 1 ; reset x_l ptr 
'pctr SUB pod h,rsthl,h ; reset h ptr 
'pctr SUB ~S2 h_1,rsth2,h_1 ; reset h_l ptr 
lpcrr MVK -S1 4,pctr ; reset pointer reset lp cntr 
fectr SUB »S2 octr, 1,octr ; dec outer lp cntr 
[octr B “o2 LOOP 7 Branch outer loop 
-endproc 
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6.13.7 Determining the Minimum Iteration Interval 


Based on Table 6-27, the minimum iteration interval is 8. An iteration interval 
of 8 means that two multiply-accumulates per cycle are still executing. 


Table 6-27. Resource Table for FIR Filter Code 


(a) A side (b) B side 

Unit(s) Total/Unit Unit(s) Total/Unit 
M1 8 .M2 8 

S1 7 S2 6 

.D1 5 .D2 6 

-L1 8 .L2 8 
Total non-.M units 20 Total non-.M units 20 

1X paths 1 2X paths 5 


6.13.8 Final Assembly 


Example 6—74 shows the final assembly for the FIR filter with the outer loop 
conditionally executing in parallel with the inner loop. 


Part Ill 
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Example 6—74. Final Assembly Code for FIR Filter 


[!Al] SUB 


[!Al1] 
[!Al1] 


FPunwW 
OaGG 
= WwW Ww 


[!Al] SUB 


[A2] SUB 


[!A2] 


Ss 

M 

M 
[A2] ADD 

L 

L 

Z 


B4,A0 
B4,4,B2 
A4,B1 
A4,4,A4 
200,BO 


*A4H 
*B1lt+ 
4,Al 


*B2+ 
*AO+4 


60,A3 
60,Bl1 


*B1l+ 
*A4H 


PLZ). ALO 


+[2],B7 
+[2],A8 


+[2],A11 
+[2],B10 


Al,1,Al1 
64,A5 
64,B5 
A6o,2,B6 


*AO+4 
*B2+ 


t[2],A9 
+[2],B8 


A4,A3,A4 


B1,B1 


14,Bl1 


AO, A5,A0 
*B1,A8 


A10,0,B8 


5,A2 


A8,B8,B4 
B2, BS; B2 
A8,B9,A14 


A8,A10,A7 
B7,B9,B13 
A2,1,A2 


Bll 


Bl11,1 


15,Bil 


B7,B9,B9 


A8,Al 
B4,B1 
*RA+4 
*B1L+4 
Al10 


10,A10 
11,B4 
+[2],B9 
+[2],A10 


, 


’ 


7* x[Jjtit2] 
7* x[Jjtit0] 
zero out initial 


’ 


point to h[0 
point to h[2 
point to x[j 


point to x[jt2 


& h[l] 
& h[3] 
& x[j+1] 


& *[j+3] 


set lp ctr ((32/8)*(100/2) ) 


x[jtit2] & x 
x[jtit0] & x 


Fit3] 
bit1] 


set pointer reset lp cntr 


h[it2 
h[it0o 


& h[it+3 
& h[itl 


used to reset x ptr (16*4-4) 
used to reset x ptr (16*4-4) 


x(jtit4] 
x[j+i+6] 


& x[j4 
& x[j4 


rit5] 
bit7] 


dec pointer reset lp cntr 
used to reset h ptr (16*4) 
used to reset h ptr (16*4) 


point to y[jt1] 


h[it4] 


reset. x ptr 


reset x ptr 
reset h ptr 
x[J+1i+8] 


move to other reg file 
set store lp cntr 


plO = h[it0]*x[j+itl 


reset h ptr 


pll = h[itl]*x[j+i+2 


poo 


& h[it5] 
h[it6] & h[it7] 


h[it0] *x[j+i+0 
pl2 = h[i+2]*x[j+i+3 


dec store lp cntr 
zero out initial accumulator 


(Bsuml >> 15) 


p02 = h[it2]*x[j+it+2] 
pOl = h[itl]*x[j+it+1] 
suml (p10) = p10 + suml 


& x[j4 
& x[j1 


tit+3] 
ti+1] 
l accumulator 
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Example 6—74. Final Assembly Code for FIR Filter (Continued) 


LOOP: 
[!A2] 
[BO] 


[A2] 


[Al] 


[BO] 


[!Al1] 


[!Al1] 
[!Al1] 


[!A2] 


[!A2] 
[!A2] 


[!Al1] 
[!Al1] 


Al10,15,A12 
BO,1,B0 
B7,B9,B13 
A7,A10,A7 
B7,A11,A10 
A14,B4,B7 
*AB2++ [2] ,-B7 
*A0++[2],A8 


A10,A7,A13 
A9,B10,B12 
A9,A11,A10 
B13,B7,B7 
*B1++[2],A11 
*A4++[2],B10 
Al,1,Al1 


LOOP 
A9,A11,A11 
B9,A13,A13 
B8,B10,B13 
A10,B7,B7 
*A0++[2],A9 
*B2++[2],B8 
A4,A3,A4 


B8,B10,B11 
A9,A11,A11 
B13,A13,A9 
A10,B7,B7 
B1,B14,Bl 
AO,A5,A0 
*B1,A8 


4,A2 
B8,B10,B13 
Al11,A9,A9 
B8,A8,A9 
BL2; BY, BLO 
B11, *B6++[2] 
Al12, *A6++[2] 
A10,0,B8 


A11,A9,A12 
B13,B10,B8 
A8,B8,B4 
4,Al 
B2,B5,B2 
A8,B9,A14 


a 


, 


, 


a 


, 


, 


, 


a 
a 


, 


;* reset x ptr 


(Asum0 >> 15) 


dec outer lp cntr 


p03 = h[it+3]*x[j+it3] 
sum0 (p00) = p00 + sum0 
pl3 = h[it+3]*x[j+it4] 
suml += pll 

* h[it2] & h[it3] 

* h[it0O] & h[itl] 


* x[jtité] 


bi+5] 


& x[j4 


bit+7] 


*x[3+i+6] 
*x[J+i+5] 


* dec pointer reset lp cntr 


Branch outer loop 


p06 = 
p05 = h[it5 
sum0 += p03 
suml += pl4 
* reset x ptr 


;* reset h ptr 
;* x[Jj+it8] 


reset store lp cntr 
p07 = h[it7]*x[j+it7] 


sum0 += p04 


p04 = h[it4]*x[j+it4] 
sum0 += p02 
pl6 = h[it6é]*x[j+it7] 
suml += p13 

;* hflit4] & h[it5] 

* h[it+6] & h[it7] 


*x[5+i+6] 
*x[5+i4+5] 


p17 = h[it+7]*x[j+it8] 


suml += p15 
y(j+1] = 
yj] = 


(Bsuml >> 15) 
(Asum0 >> 15) 


* move to other reg file 


sum0 += p05 
suml += pl6 


;* plO0 = h[itO]*x[j+i4 


+1] 


7* reset pointer reset lp cntr 


a 


, 


;* reset h ptr 


;* pll = hlitl]*x[j+i4 


+2] 


Optimizing Assembly Code via Linear Assembly 


6-145 


Part Ill 


Part Ill 


Outer Loop Conditionally Executed With Inner Loop 


Example 6—74. Final Assembly Code for FIR Filter (Continued) 


ADD sL2X A9,B8,B11 
ADD .L1X B11,A12,A12 
MPY M1 A8,A10,A7 
MPYLH .M2 B7,B9,B13 
[A2] SUB .S1 A2,1,A2 
ADD Pitp.< B13,A12,A10 
[!A2] SHR 282 B11,15,B11 
MPY .M2 B7,B9,B9 
MPYH .M1 A8,A10,A10 
[A2] ADD «2 B4,B11,B4 
LDW Dia *A4++[2],B9 
LDW D2 *B1++[2],A10 
;Branch occurs here 
{[!A2] SHR owe A10,15,A12 
[!A2] STH »D2 Bll, *B6++[2] 
|| [!A2] STH -D1 A12, *A6++[2] 


’ 


’ 


, 


sum0 


+= pl7 

+= p06 

= h[it+0]*x[Jj+i+0] 
= h[it2]*x[j+it3] 
store lp cntr 

+= p07 


* (Bsuml >> 15) 


* p02 = 


* pol 


p 
j4i4+2] & x[5tit3] 
i 


h[it2]*x[j+it2] 
= h[itl]*x[Jj+i+1] 
10) = p10 + suml 


+0] & x[j+itl] 


(Asum0 >> 15) 


y(jt1 
y[j] 


] = (Bsuml >> 15) 
= (Asum0 >> 15) 


6.13.9 Comparing Performance 


The cycle count of this code is 1612: 50 (8 x 4+0) + 12. The overhead due 
to the outer loop has been completely eliminated. 


Table 6-28. Comparison of FIR Filter Code 


Code Example 


Example 6-57 ~FIR with redundant load elimination 


Example 6-65 ‘FIR with redundant load elimination and no memory 


hits 


Example 6-67 FIR with redundant load elimination and no memory 


hits with outer loop software-pipelined 


Example 6-70 FIR with redundant load elimination and no memory 
hits with outer loop conditionally executed with inner 


loop 
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Cycles 
50 (16 x 2+9+6)+2 


50 (8 x 44+ 10+6)+2 


50 (7 X 4+6+6)+6 


50 (8 x 4+0)+12 


Cycle Count 


2352 
2402 


2006 


1612 
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Chapter 7 


Interrupts 


This chapter describes interrupts from a software-programming point of view. 
A description of single and multiple register assignment is included, followed 
by code generation of interruptible code and finally, descriptions of interrupt 
subroutines. 


Topic Page 
Tel OE MEM el nid] lS asoeses oeeededesonoseenpouseunssasceesnace 7-2 
7.2 Single Assignment vs. Multiple Assignment ..................+.+ 7-3 
23 lInterruptibles LOOPS overt cn icicieetietselesesveteelsiare s cfelelare ee efecaysreceleiare 7-5 
7.4 Interruptible Code Generation ............::cceeeee teen eee 7-6 
75> Interrupt SUBOUTIMGS erp eta rey plepe ilove siete eset ciate ete tete oe telelel eels 7-8 
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Overview of Interrupts 


7.1 Overview of Interrupts 


7-2 


An interrupt is an event that stops the current process in the CPU so that the 
CPU can attend to the task needing completion because of another event. 
These events are external to the core CPU but may originate on-chip or off- 
chip. Examples of on-chip interrupt sources include timers, serial ports, DMAs 
and external memory stalls. Examples of off-chip interrupt sources include 
analog-to-digital converters, host controllers and other peripheral devices. 


Typically, DSPs compute different algorithms very quickly within an asynchro- 
nous system environment. Asynchronous systems must be able to control the 
DSP based on events outside of the DSP core. Because certain events can 
have higher priority than algorithms already executing on the DSP, it is some- 
times necessary to change, or interrupt, the task currently executing on the 
DSP. 


The ’C6x provides hardware interrupts that allow this to occur automatically. 
Once an interrupt is taken, an interrupt subroutine performs certain tasks or 
actions, as required by the event. Servicing an interrupt involves switching 
contexts while saving all state of the machine. Thus, upon return from the inter- 
rupt, operation of the interrupted algorithm is resumed as if there had been no 
interrupt. Saving state involves saving various registers upon entry to the inter- 
rupt subroutine and then restoring them to their original state upon exit. 


This chapter focuses on the software issues associated with interrupts. The 
hardware description of interrupt operation is fully described in the 
TMS320C6x CPU and Instruction Set Reference Guide. 


In order to understand the software issues of interrupts, we must talk about two 
types of code: the code that is interrupted and the interrupt subroutine, which 
performs the tasks required by the interrupt. The following sections provide in- 
formation on: 


Single and multiple assignment of registers 

Loop interruptibility 

How to use the ’C6x code generation tools to satisfy different requirements 
Interrupt subroutines 


UOUOU 


Single Assignment vs. Multiple Assignment 


7.2 Single Assignment vs. Multiple Assignment 


Register allocation on the ’C6x can be classified as either single assignment 
or multiple assignment. Single assignment code is interruptible; multiple as- 
signment is not interruptible. This section discusses the differences between 
each and explains why only single assignment is interruptible. 


Example 7-1 shows multiple assignment code. The term multiple assignment 
means that a particular register has been assigned with more than one value 
(in this case 2 values). On cycle 4, at the beginning of the ADD instruction, reg- 
ister A1 is assigned to two different values. One value, written by the SUB in- 
struction on cycle 1, already resides in the register. The second value is called 
an in-flight value and is assigned by the LDW instruction on cycle 2. Because 
the LDW instruction does not actually write a value into register A1 until the end 
of cycle 6, the assignment is considered in-flight. 


In-flight operations cause code to be uninterruptible due to unpredictability. 
Take, for example, the case where an interrupt is taken on cycle 3. At this point, 
all instructions which have begun execution are allowed to complete and no 
new instructions execute. So, 3 cycles after the interrupt is taken on cycle 3, 
the LDW instruction writes to A1. After the interrupt service routine has been 
processed, program execution continues on cycle 4 with the ADD instruction. 
In this case, the ADD reads register A1 and will be reading the result of the 
LDW, whereas normally the result of the SUB should be read. This unpredict- 
ability means that in order to ensure correct operation, multiple assignment 
code should not be interrupted and is thus, considered uninterruptible. 


Example 7-1. Code With Multiple Assignment of A1 


aI oO ® W NY FR 


cycle 


SUB 
LDW 
NOP 
ADD 
NOP 


MPY 


A4,A5,Al ; writes to Al in single cycle 
*A0,A1 ; writes to Al after 4 delay slots 
A1l,A2,A3 ; uses old Al (result of SUB) 

2 

Al,A4,A5 ; uses new Al (result of LDW) 


Example 7-2 shows the same code with a new register allocation to produce 
single assignment code. Now the LDW assigns a value to register A6 instead 
of A1. Now, regardless of whether an interrupt is taken or not, A1 maintains 
the value written by the SUB instruction because LDW now writes to A6. Be- 
cause there are no in-flight registers that are read before an in-flight instruction 
completes, this code is interruptible. 


Interrupts 7-3 


Part Ill 


Part Ill 


Single Assignment vs. Multiple Assignment 


Example 7-2. Code Using Single Assignment 


cycle 
1 SUB .S1 A4,A5,Al ; writes to Al in single cycle 
2 LDW «Da *A0,A6 ; writes to Al after 4 delay slots 
3 NOP 
4 ADD -L1 A1l,A2,A3 ; uses old Al (result of SUB) 
5-6 NOP 2 
7 MPY -M1 A6,A4,A5 ; uses new Al (result of LDW) 


Both examples involve exactly the same schedule of instructions. The only dif- 
ference is the register allocation. The single assignment register allocation, as 
shown in Example 7-2, can result in higher register pressure (Example 7—2 
uses one more register than Example 7-1). 


The next section describes how to generate interruptible and non-interruptible 
code with the ’C6x code generation tools. 


Interruptible Loops 


7.3 Interruptible Loops 


Even if code employs single assignment, it may not be interruptible in a loop. 
Because the delay slots of all branch operations are protected from interrupts 
in hardware, all interrupts remain pending as long as the CPU has a pending 
branch. Since the branch instruction on the ’C6x has 5 delay slots, loops small- 
er than 6 cycles always have a pending branch. For this reason, all loops small- 
er than 6 cycles are uninterruptible. 


There are two options for making a loop with an iteration interval less than 6 
interruptible. 


1) Simply slow down the loop and force an iteration interval of 6 cycles. This 
is not always desirable since there will be a performance degradation. 


2) Unroll the loop until an iteration interval of 6 or greater is achieved. This 
ensures at least the same performance level and in some cases can im- 
prove performance (see section 6.8 Loop Unrolling). The disadvantage is 
that code size increases. 


The next section describes how to automatically generate these different op- 
tions with the ’C6x code generation tools. 
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7.4 


7.4.1 


Interruptible Code Generation 


The 'C6x code generation tools provide a large degree of flexibility for interrup- 
tibility. Various combinations of single and multiple assignment code can be 
generated automatically to provide the best tradeoff in interruptibility and per- 
formance for each part of an application. In most cases, code performance is 
not affected by interruptibility, but there are some exceptions: 


(1 Software pipelined loops that have high register pressure can fail to regis- 
ter allocate at a given iteration interval when single assignmentis required, 
but might otherwise succeed to allocate if multiple assignment were al- 
lowed. This can result in a larger iteration interval for single assignment 
software pipelined loops and thus lower performance. To determine if this 
is the problem for looped code, use the —mw feedback option. 


(1 Because loops with minimum iteration intervals less than 6 are not inter- 
ruptible, higher iteration intervals might be used which results in lower per- 
formance. Unrolling the loop, however, prevents this reduction in perfor- 
mance (see section 7.2) 


(1 Higher register pressure in single assignment can cause data spilling to 
memory in both looped code and non-looped code when there are not 
enough registers to store all temporary values. This reduces performance 
but occurs rarely and only in extreme cases. 


The tools provide 3 levels of control to the user. These levels are described in 
the following sections. For a full discussion of interruptible code generation, 
see the TMS320C6x Optimizing C Compiler User’s Guide. 


Level 0 — Specified Code is Guaranteed to Not Be Interrupted 


The compiler does not disable interrupts. Thus, itis up to the system developer 
to guarantee that no interrupts occur. This level has the advantage that the 
compiler is allowed to use multiple assignment code and generate the mini- 
mum iteration intervals for software pipelined loops. 


The command line option —mi can be used for an entire module and the follow- 
ing pragma can be used to force this level on a particular function: 


#pragma FUNC_INTERRUPT_THRESHOLD (func, uint_max) ; 


Interruptible Code Generation 


7.4.2 Level 1 — Specified Code Interruptible at All Times 


The compiler will not disable interrupts. Thus, the compiler will employ single 
assignment everywhere and will never produce a loop of less than 6 cycles. 
The command line option —mi1 can be used for an entire module and the fol- 
lowing pragma can be used to force this level on a particular function: 


#pragma FUNC_INTERRUPT_THRESHOLD (func, 1); 


7.4.3 Level 2 — Specified Code Interruptible Within Threshold Cycles 


The compiler will disable interrupts around loops if the specified threshold 
number is not exceeded. In other words, the user can specify a threshold, or 
maximum interrupt delay, that allows the compiler to use multiple assignment 
in loops that do not exceed this threshold. The code outside of loops can have 
interrupts disabled and also use multiple assignment as long as the threshold 
of uninterruptible cycles is not exceeded. If the compiler cannot determine the 
loop count of a loop, then it assumes the threshold is exceeded and will gener- 
ate an interruptible loop. 


The command line option —mi (threshold) can be used for an entire module and 
the following pragma can be used to specify a threshold for a particular func- 
tion. 


#pragma FUNC_INTERRUPT_THRESHOLD (func, threshold); 
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7.5 Interrupt Subroutines 


The interrupt subroutine (ISR) is simply the routine, or function, that is called 
by an interrupt. The ’C6x provides hardware to automatically branch to this 
routine when an interrupt is received based on an interrupt service table. (See 
the Interrupt Service Table in the TMS320C6x CPU and Instruction Set Refer- 
ence Guide.) Once the branch is complete, execution begins at the first exe- 
cute packet of the ISR. 


Certain state must be saved upon entry to an ISR in order to ensure program 
accuracy upon return from the interrupt. For this reason, all registers that are 
used by the ISR must be saved to memory, preferably a stack pointed to by 
a general purpose register acting as a stack pointer. Then, upon return, all val- 
ues must be restored. This is all handled automatically by the C compiler, but 
must be done manually when writing hand-coded assembly. 


7.5.1 ISR with the C Compiler 


7-8 


The C compiler automatically generates ISRs with the keyword interrupt. The 
interrupt function must be declared with no arguments and should return void. 
For example: 


interrupt void int_handler () 
{ 


unsigned int flags; 
} 


Alternatively, you can use the interrupt pragma to define a function to be an 
ISR: 


#pragma INTERRUPT (func) ; 


The result either case is that the C compiler automatically creates a function 
that obeys all the requirements for an ISR. These are different from the calling 
convention of a normal C function in the following ways: 


Lj All general purpose registers used by the subroutine must be saved to the 
stack. If another function is called from the ISR, then all the registers 
(AO—A15, BO—B15) are saved to the stack. 


Li ABIRP instruction is used to return from the interrupt subroutine instead 
of the B B3 instruction used for standard C functions 


_j A function cannot return a value and thus, must be declared void. 


See the section on Register Conventions in the TMS320C6x Optimizing C 
Compiler User’s Guide for more information on standard function calling con- 
ventions. 


7.5.2 ISR with Hand-Coded Assembly 


Interrupt Subroutines 


When writing an ISR by hand, it is necessary to handle the same tasks the C 


compiler does. So, the following steps must be taken: 


a 


All registers used must be saved to the stack before modification. For this 
reason, itis preferable to maintain one general purpose register to be used 
as a stack pointer in your application. (The C compiler uses B15.) 


If another C routine is called from the ISR (with an assembly branch in- 
struction to the _c_func_name label) then all registers must be saved to 
the stack on entry. 


AB IRP instruction must be used to return from the routine. If this is the 


NMI ISR, a B NRP must be used instead. 


An NOP 4 is required after the last LDW in this case to ensure that BO is 


restored before returning from the interrupt. 


Example 7-3. Hand-Coded Assembly ISR 


St 
oT 
st 
st 
st 
st 


SBfBaatas 


LDW 
LDW 
LDW 
LDW 


BO; *B15-- 
AO, *B15-- 
BL yt BlS=— 
BA; *BilS == 
B3;,*Blo== 
B4, *B15-— 


* End of ISR code 


+B15,B4 


+B15,B3 
+B15,B2 


+B15,Bl1 
+B15,A0 


Pp 


+B15,BO 


’ 
td 
’ 
’ 
’ 
’ 


* Beginning of ISR code 


Ne Ne Ne Ne Ne Ne Ne Ne Ne 


store 
store 
store 
store 
store 
store 


BO 
AO 
Bl 
B2 
B3 
B4 


restore 
restore 
reslore 
restore 
restore 


return from interrupt 


to 
to 
to 
Lo 
to 
to 


B4 
B3 
B2 
Bl 
AO 


restore BO 


allow all multi-cycle instructions 
to complete before branch is taken 


* Assume Register BO-B4 & AO are the only registers used by the 
* ISR and no other functions are called 


stack 
stack 
stack 
stack 
stack 
stack 


7.5.3 Nested Interrupts 


Sometimes it is desirable to allow higher priority interrupts to interrupt lower 
priority ISRs. To allow nested interrupts to occur, you must first save the IRP, 
IER, and CSR to a register which is not being used or to or some other memory 
location (usually the stack). Once these have been saved, you can reenable 
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Example 7-4. Hand-Coded Assembly ISR Allowing Nesting of Interrupts 


the appropriate interrupts. This involves resetting the GIE bit and then doing 
any necessary modifications to the IER, providing only certain interrupts are 
allowed to interrupt the particular ISR. On return from the ISR, the original val- 


ues of the IRP, IER, and CSR must be restored. 


* End of ISR code 


STW BO, *B15-- 
IVC IRP, BO 
STW AO, *B15-— 
VC IER, Bl 
VK mask, AO 
STW Bl, *B15-— 
VC AO, IER 
STW B2, *B15-— 

IVC CSR, AO 


STW B33; *Blo== 
OR 1,A0,A0 
STW B4,.*Bi5-— 
IVC AO,CSR 
STW BO; *Bilo-— 
STW Bi.,.*Bil5== 
STW AO, *B15-- 


LDW *+4+B15,A0 
LDW *+4B15,Bl 
LDW *+4+B15,B0 
LDW *+4+B15,B4 
LDW *4+4B15,B3 
|| Mvc AO,CSR 
LDW *+4B15,B2 
|| MVC BO, IRP 
LDW *+4B15,Bl 
|| Mvc B1,IER 
LDW *+4+B15,A0 
|| B IRP 
LDW *+4+B15,B0 
NOP 4 


, 


, 
, 


, 


* Beginning of ISR code 


* Assume Register BO-B4 & AO are the only registers used by the 
* ISR and no other functions are called 


store BO to stack 

save IRP 

store AO to stack 

save IER 

setup a new IER (if desirable) 
store Bl to stack 

setup a new IER (if desirable) 
store B2 to stack 

read current CSR 

store B3 to stack 

set GIE bit field in CSR 
store B4 to stack 

write new CSR with GIE enabled 


store 
store 
store 


BO to stac 
Bl to stac 
AQ to stac 


restore AO 


k (contains IRP) 
k (contains IER) 
k (original CSR) 


restore 
restore 
reslore 
restore 
restore 
restore 
restore 
reslore 
restore 
restore 


Bl 
BO 
B4 
B3 


(original CSR) 
(contains IER) 
(contains IRP) 


original CSR 


B2 


original IRP 


Bl 


original IER 


AO 


return from interrupt 


restore 


allow all multi-cycle instructions 
to complete before branch is taken 


BO 
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Applications Programming 


This appendix provides extensive code examples from the Global Systems for 
Mobile Communications (GSM) enhanced full-rate (EFR) vocoder. The assem- 
bly code examples in this appendix represent hand-optimized code; the code 
produced by the assembly optimizer will vary, depending on the version used. 
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A.1 Summary of Major Programming Methods 


A-2 


The key to implementing applications on the ’C6x is to take advantage of the 
processor’s full speed. The main technique for achieving this goal involves un- 
rolling software loops to reach the limits of the functional units while meeting the 
data dependency constraints. 


In addition to loop unrolling, the following methods are helpful for improving 
performance: 


(1 Rearranging the C code 


If you are implementing a system based on an existing C code, rearranging 
the tasks in the C code is a useful method to gain better performance. 


(1 Avoiding memory bank hits 


Memory bank hits, especially those in the inner loop in a nested loop 
application, hurt the performance dramatically and must be avoided. Most 
of the memory bank hits, however, can be eliminated by allocating the 
relevant arrays properly. Some situations, like accessing a word and a half- 
word in the same cycle, can also create the chance of a memory bank hit 
and should also be avoided. 


If the system implementation is quite complicated, the program-memory size 
becomes an issue. To achieve a good balance between program-memory size 
and speed, you can implement the less critical portions with highly-compact 
assembly code that sacrifices performance. 


Implementation of the GSM EFR Vocoder 


A.2 Implementation of the GSM EFR Vocoder 


This section presents the implementation of some representative pieces of code 
for the Global Systems for Mobile Communications (GSM) enhanced full-rate 
(EFR) vocoder. These include the: 


Multiply-accumulate loop 

Windowing and scaling part of autocorr.c 
cor_h 

rrv computation in search_10i40 

Index search in search_10i40 

FIR filter (residu.c) 

Lag search in the lag_max ( ) routine 


OUUOUUUU 


Note: 


European Telecommunications Standards Institute (ETSI) has the copyright 


to all the C code used in this section. 
(SSS 


The following global constants/symbols are defined in the EFR vocoder: 


#define Word16 — short 

#define Word32 _ int 

#define MAX_32  Ox7fffffffL 
#define MIN 32 Ox80000000L 
#define MAX_16 Ox7fff 

#define MIN_16 0x8000 
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A.2.1_ Implementation of the Multiply-Accumulate Loop 


First, examine the most popular loop used in almost every fixed-point vocoder, 
the multiply-accumulate (MAC) loop, shown in Example A-1. 


Example A-1. C Code for the Typical MAC Loop 


input: 
Wordl6 WN; (typical value of N is an even integer, 
greater than or equal to 20) 
Word1l6 *x, *y; 
result: 
Word32 sum; 
C Code 
sum=0; 
for (i=0;1i<N;i++) sum=L_mac(sum,x[i],y[i]); 
where L_mac(a,b,c) = _sadd(a,_smpy (b,c) ) 


Example A-2 shows a list of symbolic instructions for each iteration of the loop. 


Example A-2. Linear Assembly for the MAC Loop 


LOOP: 


[entre] 
[entx] 


LDH +D io 40) ed ein ee ; load x[il] 

LDH -D *yptrtt+, yi ; load y[i] 

SMPY .M xi,yi,tmp ; smpy (x[i]l,y[i]) 

SADD yi sum, tmp, sum ; sum=sadd (sum, smpy (x[iJ],y[il) 
SUB - ALU entr,1,cntr ; decrement the loop counter 
B «Ss LOOP : branch to the loop 
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In Example A—2, xpir is the pointer for the x array and ypir is the pointer for the 
y array. Because there are eight functional units, these instructions can easily 
fit into one execution packet. 


In general, unrolling the loop once as in the code in Example A—3 does not give 
the same result as the code shown in Example A—1, because of the ordering 
dependence of the saturated addition. 
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Example A—3. C Code for MAC Loop With Loop Unrolling 


Word32 sum_e, 
sum_e=0 
sum_o=0 


Sum_oO; 


’ 


4 


for (i=0;i<N;it=2) { 


sum_e=L_mac(sum_e,x[i],yl[il); 
sum_o=L_mac(sum_o,x[it+1l],y[itl]); 


} 


sum=L_add(sum_o, sum_e); 


where 


L_add(a,b)=_sadd (a,b) 


However, both approaches lead to the same result if x[i] = y[i] for every i, Be- 
cause _smpy (x[I], x[i]) is always greater than or equal to 0. This special MAC 
loop is used to compute the energy of a particular signal segment. In this case, 
take the approach shown in Example A-3, because it doubles the performance 
of the code shown in Example A—2. Example A—4 shows the C code for this spe- 
cial MAC loop. Example A-5 lists the symbolic instructions for this loop. 


Example A—4. C Code for Energy Computation MAC Loop 


} 


sum=0; 
for (i=0; i<N; i++) 
sum = L_mac(sum,x[i],x[i]); 
or 
sum_e=0; 
sum_o=0; 


for (i=0;1i<N;it=2) { 
sum_e=L_mac(sum_e,x[i],x[i]); 
sum_o=L_mac(sum_o,x[itl],x[itl]); 


sum=L_add(sum_o, sum_e) ; 


Example A-5. Linear Assembly for Energy Computation MAC Loop 


LOOP: 


SADD 


NYOnrEBOr BU 


*xptrett+, xi 
xi,xi,tmp_e 
sum_¢, Lmp_e, Sum_e 
*xptrott+, xit+l 
xi+1,xitl,tmp_o 
sum_o,tmp_o, sum_o 
centr, 2, centr 

LOOP 


sum_e, sum_o, sum 


ee eT 


load x[i 
smpy (x[i 
sum_e=sadd ( 
load x[it1] 
smpy (x[it+1],x[i+1]) 
sum_o=sadd(sum_o, smpy (x[it+1],x[i+1]) 
decrement the loop counter 

branch to the loop 


,x[i]) 
sum_e, smpy (x[i],x[i]) 


Part IV 


sum=sadd (sum_ot+sum_e) 
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In Example A-5, xptre and xptro are the pointers for the x array and, at the be- 
ginning, point to x[0] and x[1], respectively. The eight instructions in the loop fit 
perfectly into one execution packet. This approach computes two MACs in one 
cycle. It doubles the performance of the code shown in Example A-2 for the 
general MAC loop. 


The final assembly code is shown in Example A-6. 


Example A-6. Assembly Code for the Energy Computation MAC Loop 


** 
** 
** 
kk 
** 
** 
** 
** 
** 
** 
** 
** 
** 
** 
** 


KKK KK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KK KKK KK KKK KKK KKK KKK KKK K 


KKK KKK KKK KK KK KKK KKK KKK KK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KK KKK KK 


Texas Instruments, Inc HK 
K* 

MAC Loop -- Energy Computation ae 
k* 

Compute two samples a time oo 
K* 

Total cycles = (N/2+2) oe 
K* 

Register Usage: A B ee 
4 5 7 

** 

Notice that x[0] and x[1] will not be available till LOOP BK 
is executed once. Therefore, sum_e and sum_o should be Os em 
for the first three iterations. This is why A5, B5, A6, ee 
and B6 should be set to Os in the prolog. an 


; A4 -—- &x[0] 
y Ba == N 
; A6 —-- sum 
ADD .L2X A4,2,B4 ; &x[1] 
SUB aD yA B4,6,Bl1 ; loop counter 
B #82 LOOP 7 branch to the loop 
MVK eal 0,A6 ; initialize sum_e 
LDH .D1 *A4++[2],A5 ; load x[0] 
LDH .D2 *B4++[2],B5 ; load x[1] 
B 282 LOOP ; branch to the loop 
MV .L2X A6,B6 ; initialize sum_o 
LDH .D1 *A4++[2],A5 ; load x[2] 
LDH .D2 *B4++[2],B5 ; load x[3] 
B 28. LOOP 7 branch to the loop 
MV .-L1 A6,A5 ; take care the initial three iterations 
MV .L2 B6,B5 ; take care the initial three iterations 
LDH .D1 *A4++[2],A5 ; load x[4] 
LDH ~D2 *B4++[2],B5 ; load x[5] 
B vod LOOP 
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Example A-6. Assembly Code for the Energy Computation MAC Loop (Continued) 

LDH ry Ob *A4++[2],A5 ; load x[6] 

I | LDH -D2 *B4++[2],B5 : load x[7] 

LOOP 
SMPY MI A5,A5,A7 7  smpy(x[i],x[i]) 

| | SMPY .M2 B5,B5,B7 7 smpy (x[it1l],x[i+l]) 

| | SADD Ll A7,A6,A6 7 sum_e=sadd(sum_e, smpy (x[i],x[i])) 

| | SADD ~L2 B7,B6,B6 ; sum_o=sadd(sum_o,smpy (x[i+1l],x[i+1])) 

| | LDH .D1 *R44+4+[2],A5 ; load x[i] 

| | LDH .D2 *B4++[2],B5 ; load x[itl 

|| [Bl] B Sa LOOP ; branch to the loop 

|| [B1] SUB S52 Bi, 2,BL ; decrement loop counter 
SADD .L1 A6,B6,A6 ; final result, sum = sum_e + sum_o 


A.2.2. Implementation of the Windowing and Scaling Part of autocorr.c 


The autocorr.c routine is one of the most computationally intensive modules 
in the EFR vocoder. The part used in Example A-7 is used for windowing 
speech samples and for scaling down the windowed sample sequence if the 
input level is too high. Figure A—1 shows the flow diagram for this code. 
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Example A—7. C Code for the Windowing and Scaling Part of autocorr.c 


#define L_WINDOW 240 


input: 
Word16 x[L_WINDOW], wind[L_WINDOW]; 


local variables/arrays: 
Wordl6 i; 
Wordl6 y[L_WINDOW]; 
Word32 sum; 
Wordl6 overfl, overfl_shft; 


Original C code: 


/* Windowing of signal */ 
for (i = 0; i < L_WINDOW; i++) 


y[i] = mult_r (x[i], wind[il]); 
} 


/* Compute r[0] and test for overflow */ 
overfl_shft = 0; 


do 
{ 


for (i = 0; i < L_WINDOW; i++) 


sum = L_mac (sum, y[i]l, ylil); 


/* If overflow divide y[] by 4 */ 


if (L sub (sum, MAX_32) == OL) 
{ 
overfl_shft = add (overfl_shft, 4); 
overfl = 1; /* Set the overflow flag */ 


for (i = 0; i < L_WINDOW; i++) 
yli] = shr (y[il, 2); 


} 
} 
while (overfl != 0); 
Where mult_r(a,b) = _sadd(_smpy (a,b) ,0x8000L) >>16 
L_mac(a,b,c)= _sadd(a,_smpy (b,c) ) 
L_sub(a,b) = _ssub (a,b) 
add(a,b) = ((_sadd((a)<<16, ((b) <<16)))>>16) 
shr(a,b) = ((b)<0O ? (_sshl((a), (-b+16))>>16): ((a)>>(b))) 
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Figure A—1. Flow Diagram for the Windowing and Scaling Part of autocorr.c 


for(i = 0;i < L_WINDOW; i++) ] Loop 1 
yi] = mult_r (x[i], wind{i]) 


for(i = Osi < L_WINDOW; i++) ] LOOP 2 
sum = L_mac (sum, yli], yli]) 


L_sub (sum, MAX_32) == OL 


Yes 


for(i = 0;1< L_WINDOW, i++) | £°0P 3 


yli] = shr (y[i], 2) 


A.2.2.1_ Unrolling the Loop 
Try the loop unrolling technique for each loop. 


Example A-8 shows the list of symbolic instructions needed to execute one it- 
eration of loop 1. Youcan use any arithmetic logic unit (ALU) for the loop-count- 
er update. 


Example A-—8. Linear Assembly for One Iteration of autocorr.c (Loop 1) 


= 
LOOP1: ra 
LDH .D *windptr++, windi ;load wind[i] Pg 
LDH .D *xptrt+, xi j;load x[i] 
SMPY .M windi, xi, windxi0 ;smpy (x[i],wind[i]) 
SADD .L windxi0,0x8000L,wind1lxil ;sadd(smpy(x[i],wind[i]),0x8000L) 
SHR fs) windxil,16,yi ;sadd(smpy (x[i],wind[i]),0x8000L)>>16 
STH -D yi, *yptrt++ ;store y[i] 
[centr] SUB - ALU entr,1,cntr ;decrement loop counter 
[centr] B eo. LOOP 1 ;branch to loop 
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In Example A-8, windptr, xptr, and yptr are the pointers of wind, x, and y. 


The .D unit is used most often (three times). With properly partitioned resources, 
this is a 2-cycle loop. 


If you unroll the loop once and load both x and wind in words (in GSM EFR, 
both x and wind can be loaded in words if they are map-aligned with the word 
boundary), you can compute two y values with two cycles. The following is the 
new list of the instructions in one loop iteration. 


Example A-9. Linear Assembly for Loop 1 of autocorr.c (Using LDW) 


LOOP 1: 


[centr] 
[centr] 


LDW 
LDW 
SMPY 
SMP YH 
SADD 
SADD 
SHR 
SHR 
STH 
STH 
SUB 
B 


nnvoHHeHEE ZOD 


*windptr++, windi_windi+l ;load wind[i] and wind[it+1] 
*xptrt++,xi_xitl ;load x[i] and x[it+1] 

windi_windi+1, xi_xi+1,windxi0 ;smpy (x[i],wind[i]) 

windi_windi+1, xi_xit+1l,windxi0+1 ;smpy (x[it1],wind[i+1] 

windxi0, 0x8000L, windxil ;sadd(smpy (x[i],wind[i]),0x8000L) 
windxi0+1,0x8000L, windxil+1 ;sadd(smpy (x[i+1],wind[it+1]),0x8000L) 
windxil,16,yi ;sadd(smpy (x[i],wind[i]),0x8000L) >>16 
windxil+1,16,yitl ;sadd(smpy (x[i+1],wind[it+1]),0x8000L) >>16 
yi, *yptret+[2] ;store y[i] 

yitl, *yptrot+[2] ;store y[itl] 

entr,2,cntr ;decrement loop counter 

LOOP1 ;branch to loop 


In Example A~9, yptre and yptro are the pointers for the y array and, at the be- 
ginning, point to y[0] and y[1], respectively. 


Note: 


Loop 2 is a special MAC loop, as described in section A.2.1 on page A-4. It 
can be implemented either as shown in Example A—10 without loop unrolling 


or as in Example A-11 with loop unrolling for one iteration. 
a) 


Example A-10. Linear Assembly for Loop 2 of autocorr.c (No Loop Unrolling) 


LOOP2: 


[entr] 
Lente] 


LDH 


SMP Y 
SADD 


SUB 
B 


NN FeO 


*yptrt+t+,yi ;load y[i] 

yi,yi,yyi ismpy(y[il,y[il) 

sum, yyi, sum ;sadd(sum, smpy (y[i],yl[il)) 
centr, 1,cntr ;decrement loop counter 
LOOP2 ;branch to loop 


A-10 
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Example A-11. Linear Assembly for Loop 2 of autocorr.c (With Loop Unrolling) 


LOOP2: 

LDH D *yptrett,yi ;load y[i 

LDH -D *yptrott+,yitl ;load y[i+1l] 

SMPY  .M yi,yi,yyi ;smpy (y[il,y[il) 

SMPY M yitl,yitl,yyitl ;smpy (y[it+1],y[it1]) 

SADD -L sum_e, yyi, sum_e ;sadd(sum_e, smpy (y[i],y[il]) 

SADD L sum_o, yyitl,sum_o ;sadd(sum_o,smpy (y[it+1],y[i+1])) 
[cntr] SUB fis) entr,2,cntr ;decrement loop counter 
[cntr] B oO LOOP2 ;branch to loop 

SADD .L sum_e, Sum_o, sum ; sum=sum_o+sum_e 


Later, you will see that both approaches are used in this application. 


Loop 3 is a single-cycle loop and you cannot speed it up by simply unrolling 
the loop. The instructions for each iteration are shown in Example A-12. 


Example A-—12. Linear Assembly for Loop 3 of autocorr.c 


LOOP3: 

LDH -D xyptrl+t+,yi ;load y[i] 

SHR 2 yi,2,yi0 ;shr(y[i],2) 

STH =D yi0, *yptrst+ ;store y[i]=shr(y[il,2) 
[centr] SUB .L entr,1,cntr ;decrement loop counter 
[centr] B ec} LOOP 3 ;branch to loop 


In Example A—12, yptrl is the pointer for loading the y array and yptrs is the 
pointer for storing the y array. 


The new flow diagram is shown in Figure A-2. 
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Figure A—2. Flow Diagram for autocorr.c With Loop Unrolling 


for(i = 0; i < L_ WINDOW; i+=2) { 
y[i] = mult_r (x[i], wind[i]) 
y[i+1] = mult_r (x[i+1], wind[i+1]) 


Loop 1 


for(i = 0; i < L_WINDOW,; i++) { Loop 2 
sum_o = L_mac (sum_o, y[i], y[i]) 
sum_e = L_mac (sum_e, y[i+1], y[i+1]) 


sum = sum_o+sum_e 


L_sub (sum, MAX_32) == OL 
? 


for(i = 0; i < L_WINDOW;; i++) Loop 3 
y[i] = shr(y[i], 2) 


A.2.2.2 Rearrange the C Code 
The first execution of loop 2 can be combined with loop 1 to form a new loop | 
and its subsequent executions can be combined with loop 3 to form a new 


loop Il. 


Another small change is the implementation of if L_sub(sum, MAX_32) == OL 
as if sum == MAX_32. 


The new flow diagram with rearranged C code is shown in Figure A-3. 
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Figure A—3. Flow Diagram for autocorr.c With Rearranged C Code 


for(i = 0; i < L_WINDOW,; i+=2) { 
y[i] = mult_r (x{i], wind[i]) 
y[i+1] = mult_r (x[i+1], wind[i+1]) 
sum_o = L_mac (sum_o, yi], y[i]) 
sum_e=L_mac (sum_e, y[i+1], y[i+1]) 


sum = sum_o+sum_e 


for(i = O;icL_ WINDOW; i++) { 


y[i] = shr(y[i], 2) 
sum = L_mac (sum, y[i], y[i]) 


} 


You can implement loop | as one of the two approaches as shown in 
Example A-13. 


Part IV 


Applications Programming A-13 


Implementation of the GSM EFR Vocoder 


Example A-—13. Linear Assembly for Loop | of autocorr.c (Modified) 


Part IV 


LOOPTI: 
LDW -D *windptrt++,windi_winditl ;load wind[i] and wind[i+1] 
LDW -D *xptrt++,xi_xitl ;load x[i] and x[itl] 
SMPY -M windi_winditl, xi_xi+1l,windxi0 ;smpy (x[i],wind[i]) 
SMPYH .M windi_windi+l, xi_xit+1,windxi0+1 ;smpy (x[i+1],wind[it+1]) 
SADD -L windxi0, 0x8000L, windxil ;sadd(smpy (x[i],wind[i]),0x8000L) 
SADD -L windxi0+1,0x8000L, windxil+1 ;sadd(smpy (x[i+1],wind[it+1]),0x8000L) 
SHR -S windxil,16,yi ; sadd(smpy (x[i],wind[i]),0x8000L) >>16 
SHR -S windxil+1,16,yitl ;sadd(smpy (x[i+1],wind[it+1]),0x8000L) >>16 
SMPY .M yi,yi,yyi ismpy (y[il,y[i]) 
SMPY -M yitl,yitl,yyitl ;smpy (y[it1],y[it1]) 
SADD -L sum_e,yyi,sum_e ; sum_e=sadd (sum_e, smpy (y[i],y[i])) 
SADD -L sum_o,yyit1,sum_o ; sum_o=sadd(sum_o, smpy (y[it+1],y[i+1]) 
STD -D yi,*yptret+[2] ;store y[i] 
STD -D yitl,*yptrot++[2] ;store y[it+l] 
[cntr] SUB -S ‘cntr,2,centr 7;decrement loop counter 
[cntr] B -S LOOPI ;branch to loop 
or as 
LOOPT: 
LDW -D *windptr++,windi_windi+l ;load wind[i] and wind[it+1] 
LDW -D *xptr++,xi_xitl ;load x[i] and x[it+1l] 
SMP Y -M windi_windi+l, xi_xit1l,windxi0 ;smpy (x[i],wind[i]) 
SMPYH .M windi_windit+1l, xi_xit+1,windxi0+1 ;smpy (x[it1],wind[i+1]) 
SADD -L windxi0, 0x8000L, windxil ;sadd(smpy (x[i],wind[i]),0x8000L) 
SADD -L windxi0+1,0x8000L, windxil+1 ; sadd (smpy (x[i+1],wind[i+1]),0x8000L) 
SHR -S  windxil,16,yi ;sadd(smpy (x[i],wind[i]),0x8000L) >>16 
SHR -S  windxil+1,16,yit+l ; sadd(smpy (x[it+1],wind[it+1]),0x8000L) >>16 
SMPYH .M windxil,windxil,yyi pempy (y[1],y12)) 
SMPYH .M windxit+1l,windxi+1l,yyit+l PESTON A OV elect al all ey aya [eter ta) 
SADD -L sum_e,yyi,sum_e ; sum_e=sadd (sum_e, smpy (y[i],y[il)) 
SADD -L  sum_o,yyitl,sum_o ; sum_o=sadd (sum_o, smpy (y[it+1],y[i+1]) 
STD -D yi,*yptret+[2] ;store y[i] 
STD -D yitl,*yptrot+[2] ;store y[itl] 
[cntr] SUB sS ‘Gntr; 2, cntr ;decrement loop counter 
{cntr] B -S LOOPI ;branch to loop 


The only difference between these two implementations is how to compute yyi 
and yyi + 1. Using yyi as an example, the former approach computes yyi follow- 
ing the order of the original C code: 


yyi = _smpy (_sadd(_smpy (a,b) ,0x8000L)>>16, 
_sadd(_smpy(a,b),0x8000L)>>16), 


The latter computes yyi in a slightly different way as: 


smpyh (_sadd(_smpy (a,b) ,0x8000L), 
_sadd(_smpy(a,b),0x8000L)). 


yyi = 
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This provides the flexibility to better pack the instructions and reduces cycle 
count. 


Loop lis a two-cycle loop. Loop Il is still a single-cycle loop. Its instructions are 
shown in Example A-14. 


Example A—14. Linear Assembly for Loop I! of autocorr.c (Modified) 


LOOPII: 
LDH 2D xyptrlt+t+,yi  ;load y[il] 
SHR »S yi,2,yi0 ;shr(y[i],2) 
SMPY : yi0,yi0,yyi ;smpy(shr(y[i],2),shr(y[il,2)) 
SADD GL sum, yyi, sum ; sum=sadd (sum, smpy (shr(y[i],2),shr(y[i],2))) 
STH .D yi0,*yptrs++ ;store y[iJ=shr(y[i],2) 
[cntr] SUB eli entr,1,cntr ;decrement loop counter 
[centr] B ars) LOOPII ;branch to loop 


A.2.2.3. Memory Bank Hits 
To schedule loop | as a 2-cycle loop: 


J xf{i] + x[i+ 1] << 16 and wind[i] + wind[i + 1] << 16 must be loaded in the 
same cycle. 


Lj yli] and y[i+1] must be stored in the other cycle. 
To avoid a memory bank hit: 


Lj Allocate x and wind in different memory spaces, if possible. For instance, 
allocate wind{[i] in data ROM and x in data RAM. 


L1 If no data ROM is available, allocate x and wind so they are offset from 
each other by one word. 


There is no memory bank problem when storing y[i] and y[i + 1]. 


No memory bank hits occur in loop II, because the distance between the load 
and store is always six halfwords. 
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The modified C code of this part of autocorr.c is shown in Example A-15. 
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Example A-15. Implemented C Code for autocorr.c 


Wordl6 i; 

Word1l6 y[L_WINDOW] ; 

Word32 sum, sum_e, sum_o; 
Word1l6 overfl, overfl_shft; 


/* Windowing of signal */ 


sum_e=sum_o=O0L; 


for (i = 0; i < L_WINDOW; i+=2) 

{ 
yfi] = mult_r (x[i], wind[i]); 
yfitl] = mult_r(x[itl], window[it+1]); 
sum_e = L_mac(sum_e, yli], ylil); 
sum_o = L_mac(sum_o, y[itl], y[itl]); 


} 
sum=sum_e+sum_o; 
/* Compute r[0] and test for overflow */ 


overfl_shft = 0; 


do 
{ 
overfl = 0; 
/* If overflow divide y[] by 4 */ 
if (sum == MAX_32) 
{ 
overfl_shft = add (overfl_shft, 4); 
overfl = 1; /* Set the overflow flag */ 
sum=0L; 
for (i = 0; i < L_WINDOW; i++) 
{ 
yli]l = shr (y[lil, 2); 
sum = L_mac(sum, y[i], ylil); 
} 
} 
} 
while (overfl != 0); 


A.2.2.4 Code-Size Reduction 


Finally, consider the code-size reduction. In Figure A-3 on page A-13, loop | is 
always executed and loop II is executed only for high-input levels. This means 
that cycle count is the most important factor for loop I, while code size is more 
critical for loop II. 


A.2.2.5 Final Assembly Code 
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The final assembly code is presented in Example A-16. 


Example A—16. Assembly Code for Windowing and Scaling Part of autocorr.c 


K* 


K* 


Kk 


Kk 


K* 


K* 


kk 


Kk 


k* 


K* 


Kk 


k* 


K* 
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Implementation of The Windowing and Scaling Part of autocorr.c 
In EFR 


Compute two samples a time 


Total 


cycles 


257 (No Scaling) 
519 (One Scaling) 


Register Usage: 


; B4 —- &x[0] 

; A4 —- &window[0] 

; A6 —- &y[0] 

; B8 -- L_WINDOW 

; AO -- sum and sum_e 
; BO -- sum_o 

; B15 -- stack pointer 
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; notice that we use the latter approach in Example A-13 


SHL 
MVK 
ADD 
MV 


LDW 
LDW 
MVKLH 
MV 


-D2 
-D1 
Sl 
-S2 


.L1X 
-S2 


D2 
-D1 
Sl 
-L1X 


*B4++,B5 
*A4++,A5 
480,A6 
B8,6,Bl 


B15,A6,A6 
Bd 


A6o,2,B6 
Ao,A3 


*B44++,B5 
*A44++,A5 
32767, A10 
B7,A7 


load x[0] & x[1] 

load wind[0] & wind[1] 
reserve space for y[i] 
LOOP I counter 


&y [0] 
load x[2] & x[3] 
load wind[2] & wind[3] 


32768 or O0x8000L for rounding 


load x[4] & x[5] 

load wind[4] & wind[5] 
7£fffFLff = MAX_32 
32768 


Kk 


k* 


Kk 


k* 


k* 


Kk 


kk 


K* 


Kk 


k* 


K* 


K* 


kK* 
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Example A-16. Assembly Code for Windowing and Scaling Part of autocorr.c (Continued) 


SMPYH .M2X B5,A5,B2 
SMPY .M1X B5,A5,A2 
B +82 LOOPI 
LDW D2 *B4++,B5 
LDW D1 *A4++,A5 
MVK ood 0,A0 
MVK 82 0,BO 
SMP YH M2X B5,A5,B2 
SMPY M1X B5,A5,A2 
SADD L A2,A7,A2 
SADD .L2 B2,B/,;B2 
B sok LOOPI 
LDW D2 *B4++,B5 
LDW D *A4++,A5 
SHR S A2,16,A9 
SHR S2 B2,16,B9 
SMP YH M A2,A2,A11 
SMP YH M2 B2,B2,Bl11 
LOOPI: 
STH D A9, *A6+4+[2] 
STH D2 B9, *B6++[2] 
SADD L A2,A7,A2 
SADD L2 B2,B7,B2 
SMP YH M2X B5,A5,B2 
SMPY M1X B5,A5,A2 
[B1] SUB S2 B1,2,B1 
[Bl] B S LOOPI 
SADD L AO,A11,A0 
SADD L2 BO,B11,B0 
SMP YH M A2,A2,A11 
SMP YH M2 B2,B2,Bl11 
SHR S A2,16,A9 
SHR S2 B2,16,B9 
LDW D2 *B4++,B5 
LDW D *A4++,A5 
SADD L1xX AO,BO,A0O 
MPY M2 BO,0,BO 


smpy (x[1],wind[1]) 
smpy (x[0],wind[0]) 


load x[6] & x[7] 
load wind[6] & wind[7] 


sum_o 0 

sum_e = 0 

smpy (x[3],wind[3]) 

smpy (x[2],wind[2]) 

sadd(smpy (x[1],wind[1]),0x8000L) 
sadd(smpy (x[0],wind[0]),0x8000L) 


load x[8] & x[9] 

load wind[8] & wind[9] 

y[1l]=sadd(smpy (x[1],wind[1]),0x8000L) >>16 
y [0]=sadd(smpy (x[0],wind[0])+0x8000L) >>16 
smpy (y[0],y[0]) 

smpy (y[1],y[1]) 


store y[1] 

store y[0] 

sadd(smpy (x[3],wind[3]),0x8000L 
sadd(smpy (x[2],wind[2]),0x8000L 
smpy (x[5],wind[5]) 

smpy (x[4],wind[4]) 

decrement the loop counter 


sum_e += smpy(y[0],y[0]) 
sum_o += smpy(y[1],yl[1]) 
smpy (y[2],y[2]) 
smpy (y[3],y[3]) 
y [3]=sadd (smpy 
mpy 
x 


(x[3],wind[3]),0x8000L) >>16 
y[2]=sadd(smpy (x[2],wind[2]),0x8000L) >>16 
load x[10] & x[11] 
load wind[10] & wind[11] 
sum = sum_e + sum_o 


overfl_shift = 0 
LOOP I completed 
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Example A—16. Assembly Code for Windowing and Scaling Part of autocorr.c (Continued) 


LTEST: 


Al 
Al 


Al 
Al 
Al 


LOOPITI: 


[Bl] 


[Bl] 


FINISH: 


AO,A10,Al1 


FINISH 
*A3,B5 
A3,2,B9 
BO, 4,B0 


*BO++,. BO 
B8,7,Bl 
LOOP II 
A3,A9 
*B9++,B5 
0,A0 
LOOPII 


*BO++ BS 
LOOPII 
AO0,A2 


*B9++,B5 
LOOPIT 


*B9++,B5 
B5,2,A5 
LOOPIT 


*B9++,B5 
B5,2,A5 

LOOPII 

A5, *A9++ 
B1,-1,B1 
A5,A5,A2 
A2,A0,A0 


A5, *A9++ 
A5,A5,A2 
A2,A0,A0 
LTEST 


A2,A0,A0 


A2,A0,A0 


if (sum == MAX_32) 


No, exit 

load y[0] 

é&y[1] 

add (overfl_shift, 4) 


load y[1] 
counter for LOOPII 


load y[3] 
to take care of the initial condition 


load y[4] 


load y[5] 
y[0]=shr (y[0],2) 


load y[6] 

yl] = shr(y[1],2) 
branch 

store y[0] 

decrement LOOPII counter 
smpy (y[0],y[0]) 

sum +=smpy(y[il,y[il]) 


store y[n-1] 

smpy (y[n-1],y[n-1]) 

sum +=smpy (y[n-3],y[n-3]) 
branch back to LTEST 


sum +=smpy (y[n-2],y[n-2]) 
sum +=smpy (y[n-1],y[n-1]) 


save the code siz 
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If code size is not an issue, you can eliminate the last three NOPs by ex- 
panding the epilog of loop II. This saves three cycle counts every time loop II 
executes; however, code size increases by two fetch packets (2 x 32 = 64 bytes). 


A.2.3 Implementation of cor_h 


The cor_h routine is the second most computationally intensive routine 
called to compute the matrix of autocorrelation, rr. The core part of cor_his 
presented in Example A-17. 


Example A-—17. C Code for cor_h 


#define L_CODE 40 


input: 
Wordl6 sign[L_CODE], h[L_CODE]; 


output: 
Word16 rr[L_CODE] [L_CODE]; 


local variables/arrays: 
Word16 h2[L_CODE]; /* function of h, the impulse response of weighted 
synthesis filter */ 
Wordl6 dec, Jj, i, k; 
Word32 s; 


Original C code 


for (dec=1; dec<L_CODE; dec++) 
{ 


s = 0; 
j = L_CODE-1; 
i = sub(j, dec); 
for (k=0; k<(L_CODE-dec); k++, i--, j--) 
s = L_mac(s, h2[k], h2[k+dec]); 
rr[j] [i] = mult (round(s), mult (sign[i],sign[j])); 
rrfil(3] = rr(jl (il; 
} 
where sub(a,b) = _ssub(a<<16, b<<16)>>16 
L_mac(a,b,c) = _sadd(a,_smpy (b,c) ) 
mult(a,b) = _smpy(a,b)>>16 
and round(a) = _sadd(a,0x8000L)>>16 


The instructions to execute one iteration of the inner loop are listed in 
Example A-18. 
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Example A-—18. Linear Assembly for cor_h (One Inner Loop Iteration) 


INNERLOOP : 
LDH -D *h2ptr+t+,h2k j;load h2[k 
LDH -D *h2decptrt++,h2deck j;load h2[k+dec] 
SMPY .M h2k, h2deck, h2kk ;smpy (h2[k],h2[k+dec] ) 
SADD .L s,h2kk,s ;sadd(s,smpy (h2[k],h2[k+dec] ) 
SADD .L s,0x8000L, sround j; round (s) <<16 
LDH -D *signiptr--,signi ;load sign[i] 
LDH -D *signjptr--,signj ;load sign[j] 
SMPY .M signi, signj,signij ;smpy (Sign[i],sign[j])=mult (sign[i],sign[j])<<16 
SMP YH .M signij, sround, rrji0 ;L_mult (round(s),mult (sign[i],sign[j])) 
SHR uD rrji0,16,rrji 7rr [3] [i] 
STH -D rrji, *rrjiptr—- [41] estore rre[j) [i] 
STH -D rrji, *rrijptr—- [41] ;store rr[i][j] 
[icntr] SUB.ALU icntr,1,icntr ;decrement inner loop counter 
[icntr] B.S INNERLOOP ;branch to inner loop 


In Example A-18, h2ptr and h2decptr are the pointers for h2, pointing to h2[k] 
and h2[k+dec]. The pointers for sign, signiptr and signjptr, point to sign[i] and 
sign[j]. The pointers for rr, rrjiptr and rrijptr, point to rr[j][i] and rr[i][j], respec- 
tively. 


Notice that each element rr[j][i] is implemented as: 
rr[j][i] = (.smpyh (_sadd (s, Ox8000L), _smpy (sign[i], sign[j]))) >> 16 


The .D unit is used most often (six times in the inner loop). Ideally, these 
instructions can be arranged in three cycles. However, memory bank hits oc- 
cur with any combination of the load and/or store instructions. 
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Next, consider unrolling the inner loop once. The C code is shown i 


Example A-19. 


Example A-19. C Code for cor_h (With Inner Loop Unrolling) 


for 


{ 


(dec=1; dec<L_CODE; dec+t) 


s = 0; 
3 = L_CODE-1; 
i = sub(j, dec); 


for (k=0; k<(L_CODE-dec); kt+=2, i-=2, j-=2) 
{ 


s = Lmac(s, h2[k], h2[k+dec]); 
rr[j] [il = mult (round(s), mult (sign[i],sign[j])); 
eral [3] = reg) (al; 
s = Lmac(s, h2[k+1], h2[k+l+dec]); 
rr[j-1] [i-1] = mult (round(s), mult (sign[i-1],sign[j-1])); 
rr[i=1][j-1] = rr[g-1)[i-1]; 

} 

if ((dec&1)!=0) { 
s = L_mac(s,h2[L_CODE-dec-1],h2[L_CODE-1]); 
rr[dec] [0] mult (round(s),mult (sign[0],sign[dec])); 
rr[0] [dec] rr[dec] [0]; 
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Eight values must be loaded and four values must be stored in every iteration; 
however, h2[k] and h2[k +1] can be loaded in a word. The same is true for 
sign[j] and sign[j—1]. A total of six loads are required. The inner loop 


instructions are shown in Example A-20. 
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Example A-20. Linear Assembly for cor_h (With Inner Loop Unrolling) 


INNER LOOP: 

LDW .D *h2ptr++,h2k_h2k+1 jload h2[k] and h2[k+1] 

LDH pa *h2decptr++, h2deck j;load h2[k+dec] 

SMPY -M  h2k_h2k+1,h2deck,h2kk0 ;smpy (h2 [k],h2[k+dec] ) 

SADD mi s,h2kk0,s ;sadd(s,smpy (h2[k],h2[k+dec] ) 

SADD ae s,0x8000L, sround ; round (s) <<16 

LDH D *signiptr--, signi jload sign[i 

LDW .D  *signjptr--,signj_signj-1 jload sign[j] and sign[j-1] 

SMPYLH . signi,signj_signj-1,signij0 ;smpy(sign[i],sign[j]) 

SMP YH 7 signi j0,sround, rrjid ;L_mult (round(s),mult (sign[i],sign[j])) 

SHR S rrji0,16,rrji ;rr[j)] [il 

STH .D rrji, *rrjiptr——[82] ;store rr[j] [i] 

STH +D rrji, *rrijptr——[82] ;store rr[i] [3] 

LDH an *h2decptrt++,h2deck+1 j;load h2[k+1+dec] 

SMPYHL . h2k_h2k+1,h2deck+1,h2kk1 ;smpy (h2 [k+1],h2[k+1+dec] ) 

SADD .L s,h2kk1,s ;sadd(s,smpy (h2[k+1],h2[k+1+dec] ) 

SADD .L s,0x8000L, sround ; round (s) <<16 

LDH -D *signiptr--,signi-l jload sign[i-1] 

SMPY : signi-1,signj_signj-1,signijl;smpy (sign[i-1],sign[j-1]) 

SMP YH signijl,sround, rrjil ;L_mult (round(s),mult (Sign[i-1],sign[j-1])) 

SHR eS) rrjil,16,rrjlil free [ jl) [2-1] 

STH 2D rrjlil, *rrjlilptr--—[82] ;store rr[j-1] [i-1] 

STH me) rrjlil, *rriljlptr--[82] ;store rr[i-1] [j-1] 
{icntr]SUB.ALUicntr,2,icntr ;decrement inner loop counter 
{icntr]B .S INNERLOOP ;branch to loop 


To avoid memory bank hits: 


_j Load words (h2[k], h2[k+1]) and (sign[i—1], sign[i]) together and allo- 
cate h2 and sign so that they are aligned with each other. 


[1 Store rr[j][i] and rr[j—1][i-—1] together and rr[i][j] and rr[i—1][j—1] to- 
gether. 


There are five load/store pairs, so each iteration requires only five cycles. You 
gain speed by eliminating both the memory bank hits, as well as by reducing 
the cycles required to complete each rr. 


The final assembly code with reduced code size is shown in Example A-21. 
Here, the primitive technique introduced in section 6.4.3.4, Priming the Loop, 
on page 6-47 is used to reduce the code size for both the prolog and epilog 
of the inner loop. 
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Example A-21. Assembly Code for cor_h With Reduced Code Size 


K* 


kK* 


K* 


K* 


Kk 


K* 


Kk 


K* 


Kk 


K* 


Kk 


K* 


Texas Instruments, Inc 
Implementation of cor_h in EFR 
Compute four rrs at a time 
Total cycles = 2533 


Register Usage: 


16 


SUB LL1 A4,1,A13 ; 
|| ADDK .S1 76,A6 i 
|| ADDK 82 3360,B6 i 
|| SUB .D1 A4,1,A2 ; 
MVK 82 0,B2 i 
|| ADD .L2 B4,2,B13 ; 
1 | MVK .S1 2,All ; 
, 
OUTERLOOP: 
LDW .D1 *A6,A10 i 
LDW D2 *B4,B12 ; 
ADD .L1X  B13,2,A3 i 
SUB “st A6,A11,A4 ; 
[A2] ADD L2X  A2,2,B0 ; 
MPY .M1 A13,A11,A3 
MPY .M2 B1l,0,Bi1l ; 
LDH D2 *B13++[2],A7 ; 
LDH D1 *A3,B7 ; 
ADD .L2X  A4,2,B9 ; 
MV .$2 B6,B14 ; 
SUB ld A6,4,A8 ; 
[B2] ADDK ica -164,A14 ; 


B 


15 


A4 --- L_CODE 
B4 --- &h2[0] 

A6 --- &sign[0] 
B6 --- &rr[0] [0] 


used to obtain érr[i][j] 
&Ssign[L_CODE-2] 

&xrxr [L_CODE-1] [L_CODE-2]+[82]=é&rr[j] [i]+[82] 
outer loop counter 


and é&rr[i-1] [5-1] 


not doing the initial store 

&h2 [k+dec] 

used to increase/decrease the pointers 
for h2 and sign 


load sign[j-1] & sign[j] 

load h2[k] & h2[k+1] 
&h2[k+dect+1] 

&sign[i-1] 

define the inner loop counter 


initialize s 


load h2[ 
load h2[ 
&sign[i] 
érr[j] [i] +[82] 
é&sign[j-3] 

from &rr[dec] [0]+[82] 


k+dec] 
k+dec+1] 


to &rr[dec] [0] 


KKEKEK KK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKKKKKKAKKK KKK KK KK 


K* 


Kk 


K* 


Kk 


K* 


Kk 


K* 


K* 


K* 


Kk 


Kk 


Kk 
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Example A—21. Assembly Code for cor_h With Reduced Code Size (Continued) 


B2] STH 
[B2] 


BO] B 
[A2] 
[A2] 
[B2] 


[BO] MV 


[NNERLOOP : 


[!B1l] B 


[Al] 


LDH 
LDH 
SMP YHL 
SMP YLH 
ADDK 
ADDK 


-D2 
DL 
~L2 
re ral 
-S2 
Sl 


-D1 
sol 
-L1X 
on 
.L2X 
D2 


52 
Sl 
.L2X 
D1 
set 
.D2 


52 
M1 
.L2X 


-D1 
sD2 
Sl 
-L1X 


D1 
D2 
-M2 
-M1X 
S52 
LL 
»L2X 


D2 
-D1 
M1 
-M2 
sl 
02 


*B9,A0 
*R4--[2],B5 
B4,4,B8 
All,2,A11 
-82,B14 
3,A1 


Al2,*A14 
-164,A9 
B6,A14 
BO,1,B0 
B14, A3,B3 
B6,2,B6 


INNERLOOP 
A2,1,A2 
A2,1,B2 
A12,*A9 
A14,A3,A9 
BO,B1 


B9,16,B10 
A3,A0,A3 
B11,A15,B9 


*A8—-, A10 
*B8++, B12 
OUTERLOOP 
B13,2,A3 


*A3,B7 
*B13++[2],A7 
B9,B5,B9 
A7,B12,A7 
B1,1,B1 
Al,1,Al 
A4,2,B9 


*B9,A0 
*A4--[2],B5 
A10,A0,A0 
B7,B12,B7 
-164,A14 
-164,B14 


load sign[i] 

load sign[i-1] 

&h2 [k+2] 

update All 

&rr[j] [i] 

determine when the stores in the inner loop 
actually starts 


store rr[dec] [0] 

from &rr[0] [dec]+[82] 
érr[j] [iJ+[82] 

inner loop counter 
grr [i-1] [4-1] 
srr[j][i-1], 
outer loop iteration 


tp é&rr[0] [dec] 


for the next 


decrement outer loop counter 

decide if the last store is needed 
store rr[0] [dec] 

érr[i] [j]+[82] 

counter for branching to outer loop 


## obtain rr[j-1] [i-1] 
# smpyh(sadd(s,0x8000L),smpy(sign[i],sign[j])) 
# sadd(s,0x8000L) 


*load sign[j] & sign[j-1] 
*load h2[k] & h2[k+1] 
outer LOOP 


&h2 [k+dect1] 


*h2 [kt+dect1] 

*h2 [kt+dec] 

# smpyh (sadd(s, 0x8000L) 
smpy (h2[k],h2 [k+dec] 
decrement the counter for branching to the outer loop 


,smpy (sign[i-1],sign[j-1])) 


decrement the inner loop 
&sign[i] 


*load sign[i] 

*load sign[i-1] 

Reet ys eee 

smpy (h2[k+1],h2[k+l+dec] ) 

## from ceri) i ij+ ae to &érr[j] [i] 

## from &rr[j-1] [i-1]+[82] to &rr[j-1] [i-1] 
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Example A-21. Assembly Code for cor_h With Reduced Code Size (Continued) 


'A1] 
'Al 


BO] 


'A2] 


FINISH: 


STH 
STH 


NOP 


-D1 
-D2 
-L1 
M2 
~L2 
proie 
-S2 


Al2,*A14 ; ## store rr[j] [li 
B10, *B14 ; ## store rr[j-1][i-] 

xX B11,A7,A5 7 S = sadd(s,smpy (h2[k],h2[k+dec] ) 

x Al10,B5,B5 ; smpy(sign[i-1],sign[j-1] 
BO,1,B0 ; decrement inner loop counter 
-164,A9 ; ## from &rr[i][j]+[82] to rrfi] [jl 
-164,B3 ; ## from &rr[i-1][j-1]+[82] to &rr[i-1] [j-1] 
Al2,*A9 ; ## store rrf[i][j 
B10, *B3 ; ## store rr[j-1][i-1 
A3,16,A12 ; # obtain rr[j-1] [i-1 
A5,A15,A3 ; sadd(s,0x8000L) 

x A5,B7,B11 7 Ss = sadd(s,smpy (h2[k+1],h2[k+dect+1] 
INNERLOOP ; end of INNERLOOP 

x B4,A11,B13 ; &h2[k+dec] 
FINISH 7; exit 
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The value of s is represented by both B11 and Ad to avoid two .L1 or two .L2 
units occurring in the same execute packet. Due to the dependence on s, as 
well as the removal of memory bank hits, it takes 20 cycles for each iteration 
of the modified C code. The pound sign (#) in the comments indicates that, 
each time the outer loop enters the inner loop, this instruction is not executed 
(or that the result of this instruction is not useful) until the number of iterations 
denoted by # has occurred. 


The code size is 11 fetch packets (352 bytes). Without applying the primitive 
technique, the code size will be at least four fetch packets more than the code 
shown in Example A-21. 


You can squeeze the instruction 
ADD . L2X B4,A11,B13 ; &h2[k+dec] 


into the inner loop to save about 1.5% of the cycle counts, with an increase in 
program memory of one fetch packet. 


Implementation of the GSM EFR Vocoder 


A.2.4 Implementation of the rrv Computation in search_10i40 


Example A-22 shows the implementation of the rrv computation in search_10i40. 


Example A-—22. C Code for the rrv Computation in search_10i40 


#define L_CODE 40 

#define STEP 5 

#define _1 16 (Word16) (32768L/16) 
#define _1 8 (Word16) (32768L/8) 


input: 
Word16 rr[L_CODE] [L_CODE], ipos[L_CODE]; 


local variables/arrays: 
Wordl6 rrv[L_CODE]; 
Word16 i0,11,i2,i13,14,15,i16,i7,18,i19; /* defined on [0,L_CODE-1] */ 
Word32 s; 


(The values of iO, i1, i2, i3, i4, i5, i6, and i7 were obtained before entering this loop.) 


Original C code 


for (19 = ipos[9]; i9 < L_CODE; i9 += STEP) 
{ 


= L mult (rr[i9][i9], _1_16); 


s ) v 
s = Lmac (s, rr[i0][i9], _1 8); 
8 = Lmac (s, er[il][i9], —1_8); 
Ss = Lmac (sp rreli2)[i9), —1_8); 
s = Lmac (s, rr[i3][i9], _1_8); 
s = L_mac (s, er[i4] [19], —1_8); 
s = Lmac (s, rr[i5][i9], _1 8); 
s = Lmac (s, rr[i6][i9], _1 8); 
s = Lmac (s, rr[i7][i9], _1_8); 
rrv[i9] = round (s); 

} 

where L_mult(a,b) = _smpy(a,b) 
L_mac(a,b,c) = _sadd(a,_smpy (b,c) ) 

and round(a) = _sadd(a,0x8000L) >>16 


The instructions for one loop iteration are shown in Example A-23. 
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Example A-23. Linear Assembly for the rrv Computation in Search_10i40 
(One Loop Iteration) 


LOOP: 
LDH 2D *rr9ptrt++[205],rr99 ;load rr[i9] [19 
SMP Y .M ergo. 1 C6;-s ;s=L_mult (rr[i9] [i9],_1_16) 
LDH D *rrOptrt++[5],rr09 jload rr[i0] [19 
SMPY .M rr09,_1_8,s0 ;Limult (rr[i0] [19],_1_8) 
SADD pal 5,805.8 ;s=L_mac(s,rr[i0] [i9],_1 8) 
LDH .D Arriptrt+ (5), 2419 j;load rr[il] [i9 
SMP Y i eri, 13,81 ;Limult (rr[il] [19],_1_8) 
SADD ells s,sl,s ;s=L_mac(s,rr[il] [i9],_1_8) 
LDH .D Arr2ptrt++ |S] ,rr29 ;load rr[i2] [i9 
SMP Y .M er29;, 1°88 2 ;Limult (rr[i2] [19],_1_8) 
SADD .L s,s2,8s ;s=L_mac(s,rr[i2] [i9],_1_8) 
LDH .D *rr3ptr++[5],rr39 jload rr[i3] [19 
SMPY .M rr39,;_1_ 8,s3 ;Limult (rr[i3] [19],_1_8) 
SADD pals s,s3,s ;s=L_mac(s,rr[i3] [i9],_1_8) 
LDH .D *rr4ptrt++[5],rr49 ;load rr[i4] [i9 
SMP Y .M rr49,_1_8,s4 ;Limult (rr[i4] [19],_1_8) 
SADD .L s,s4,s ;s=L_mac(s,rr[i4] [i9],_1_8) 
LDH .D *rr5ptrt++[5],rr59 jload rr[i5] [19 
SMP Y .M Lr 59,1355 ;Limult (rr[i5] [19],_1_8) 
SADD .L s,s5,s ;s=L_mac(s,rr[i5] [i9],_1_8) 
LDH .D *rroptr++[5],rr69 jload rr[i6] [19 
SMPY .M rr69,_1_8,s6 ;Limult (rr[i6] [19],_1_8) 
SADD .L s,s6,s ;s=L_mac(s,rr[i6] [i9],_1 8) 
LDH 2D *rr7ptrt++[5],rr79 ;load rr[i7] [i9 
SMP Y .M i chicky ee raed Leo et 7 ;Limult (rr[i7] [19],_1_8) 
SADD ~L s,s7,s ;s=L_mac(s,rr[i7] [i9],_1_8) 
SADD Pale s,0x8000L, sround ;round(s) 
SHR eo sround,16,rrv9 jrrv[i9] 
STH AB) rrv9, *rrv9ptrt++[5] ;store rrv[i9 
{icntr]SUB . ALU icntr,1,icntr ;decrement inner loop counter 
{icntr]B 2S LOOP ;branch to loop 
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The following table shows the pointers In Example A—23 and the arrays they 
point to. 


Pointer for array 
rr9ptr rr[i9][i9] 
rrOptr r[id][ig9] 
rriptr r[it][ig] 
rr2ptr r[li2][i9] 
rr3ptr r[i3][ig] 
rr4ptr r[i4][ig] 
rr5ptr r[id][i9] 
rr6ptr r[i6][ig9] 
rr7ptr r[i7][ig9] 
rrv9ptr rrv[i9] 


The .D unit is used the most (ten times per iteration). Although these instruc- 
tions can be arranged in five cycles, any combination of the load hits the same 
memory bank, Because any two values loaded are exactly 40 halfwords apart. 
It still takes ten cycles for one rrv. 
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Example A-—24. C Code for the rrv Computation in search_10i40 (Unrolled Loop) 


Next, consider unrolling the inner loop once. The C code is shown i 


Example A-24. 


for 


{ 


(19 = ipos 
s = Lmult 
S = Lmult 
s = L_mac 
S = L_mac 
s = L_mac 
S = L_mac 
s = L_mac 
S = L_mac 
s L_mac 
S = L_mac 
s L mac 
S = L_mac 
s = L_mac 
S = L_mac 
s = L_mac 
S = L_mac 
s = L_mac 
S = L_mac 


rrv[i9] = 
rrv[i9+5] 


[9]; i9 < L_CODE; i9 += 2*STEP) 


(rr[i9][i9], _1_16); 
(rr [i9+5] [i9+5], _1_16); 
(s, rr[i0][i9], _1_8); 
(S; xrr[id] [19+5], _1_ 8); 
(37. - Pe (1) [29% —1.28) 4 
(Sj Be [iL] [2-945], 1 38)+ 
(sy rr [i2] [29], 18); 
(S, rr[i2] [19+5], 1 8); 
(s; rr[i3] [29], 1-8); 
(S; £Ee [a3] [2945 ]5 —1 8)? 
(s, rrvr[i4][i9], _1_8); 
(S, er [i4] [29+5], 18); 
(sy. Br [25] [29], 1.8) 3 
(S,.. Fe [T5] [£945] 5-158) 3 
(s, rr[i6][i9], _1 8); 
(S, rr[i6][i9+5], _1_8); 
(sp. FE (D7) (29). 1 28)5 
(S; er [i7] [294+5], 18); 


round (Ss); 


= round (S); 
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Example A-25 shows the instructions for each iteration. 
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Example A-—25. Linear Assembly for rrv Computation in search_10i40 (One Loop Iteration) 


LOOP: 
LDH =D) *rr9ptrt++[410],rr99 j;load rr[i9] [i9 
SMPY : er 99, Th 1671s: 7; s=L_mult (rr[i9] [i9],_1_16) 
LDH .D *rr95ptrt++[410],rr995 ;load rr[i9t+5] [19+5] 
SMPY : rr995, 21. 16,.S 7 S=L_mult (rr [i9+5] [i9+5],_1_16) 
LDH .D *rrOptrt++[10],rr09 j;load rr[i0] [i9 
SMPY f rr09,_1_8,s0 ;L_mult (rr [id] [i19],_1_8) 
SADD ii s,s0,s ;s=L_mac(s,rr[i0] [i9],_1_8) 
LDH .D *rrOSptrt++[10],rr095 j;load rr[i0] [i9+5 
SMP Y ; rr095, 16,80 si mult (ee [20] [29+5],_1_ 8) 
SADD -L S,S0;,8 ;S=L_mac(S,rr[i0] [i9+5],_1 8) 
LDH ab) *rrlptr++[10],rr19 j;load rr[il] [i9 
SMPY ‘: rr19,_1_8,s1 ;L_mult (rr[il] [i9],_1_8) 
SADD oi S,si,s ;s=L_mac(s,rr[il] [i9],_1_8) 
LDH .D *rrlS5ptrt++[10],rr195 j;load rr[il] [i9+5 
SMPY ‘ eri95,. 1. 8,.S1 y7L_mult (rr[il] [i9+5],_1_8) 
SADD ea Srolps ;S=L_mac(S,rr[il] [i9+5],_1_8) 
LDH .D *rr2ptr++[10],rr29 j;load rr[i2] [i9 
SMPY ; rr29,_1_8,s2 ;L_mult (rr[i2] [i9],_1_8) 
SADD oD S,s2/s ;s=L_mac(s,rr[i2] [i9],_1_8) 
LDH .D *rr2ptrt++[10],rr295 jload rr[i2][i9t+5 
SMPY ; rr295,. 18,82 y;L_mult (rr[i2] [i9+5],_1_8) 
SADD ell Sro2 po ;S=L_mac(S,rr[i2] [i9+5],_1_8) 
LDH .D *rr3ptrt++[10],rr39 j;load rr[i3] [i9 
SMPY : 2r39y 18383) ;L_mult (rr[i3] [i9],_1_8) 
SADD alin s,s3,s j;s=L_mac(s,rr[i3][i9],_1_8) 
LDH .D *rr3ptrt++[10],rr395 jload rr[i3] [i9t+5 
SMP Y ‘ rr395,_1_8,S3 ;L_mult (rr[i3] [i9+5],_1_8) 
SADD .L $,83;,8 ;S=L_mac(S,rr[i3] [i9+5],_1_8) 
LDH .D *rr4ptrt++[10],rr49 j;load rr[i4] [i9 
SMPY . rr49o, 1 8,84 sL mult (xr [i4] [29] ,_1_8) 
SADD .L s,s4,s ;S=L_mac(s,rr[i4] [i9],_1_8) 
LDH .D *rr4ptrt++[10],rr49 jload rr[i4] [i9 
SMPY . rr49,_1 8,84 sL mult (rr [i4] [29],_1_8) 
SADD .L S,S4,S ,S=L_mac(S,rr[i4] [i9],_1_8) 
LDH .D *reoptr++ (10) ,rxr59 jload rr[i5] [i9 
SMP Y ‘ rr59, 1 8,85 ;L_mult (rr[i5] [i9],_1_8) 
SADD L 8, 55;,s ;s=L_mac(s,rr[i5] [i9],_1_8) 
LDH .D *rrSptrt++[10],rr595 jload rr[i5d] [i9+5 
SMP Y ‘ rr595,_1 8,65 sL_ mult (rr[i5] [29+5],_1_8) 
SADD -L 6S, 55,5 ;S=L_mac (S,;rr[i5] [19+5],_1_8) 
LDH .D *rroptrt+[10],rr69 jload rr[i6] [19 
SMPY : rr69,_1_8,s6 ;L_mult (rr[ié] [i9],_1_8) 
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Example A-25. Linear Assembly for rrv Computation in search_10i40 (One Loop Iteration) 


(Continued) 
SADD .L s,s6,s ;s=L_mac(s,rr[i6] [i9],_1_8) 
LDH «D *rréptrt++[10],rr695 j;load rr[i6] [19+5 
SMPY rr695,_1_8,S6 ;L_mult (rr[ié] [19+5],_1_8) 
SADD Pee S,S6,S ;S=L_mac(S,rr[i6] [i9+5],_1_8) 
LDH .D *rr/pertt [10] ,rr79 ;load rr[i7] [i9 
SMP Y rr79,_1_8,s7 eL mult (rr [iy] [Lo]; 1. 8) 
SADD -L S3,S7;3 ;s=L_mac(s,rr[i7] [i9],_1_8) 
LDH .D *rrJptrt++[10],rr795 ;load rr[i7] [i9+5 
SMP Y rr795,;1_8;S7 ;L_mult (rr[i7] [19+5],_1_8) 
SADD L S757 ,S ;S=L_mac(S,rr[i7] [i9+5],_1_8) 
SADD Pa s,0x8000L, sround ; round(s) 
SHR oS sround, 16; rrv9 srrv[i9] 
STH .D rrv9, *rrv9ptr++[10] ;store rrv[i9 
SADD ai S,0x8000L, Sround ;xround(S) 
SHR Sround, 16, rrv95 ;rrv[i9t5] 
STH rrv95, *rrv95ptrt++[10] ;store rrv[i9t5] 
[icntr]SUB . ALU icntr,2,icntr ;decrement inner loop counter 
[icntr]B Pas) INNERLOOP ;branch to loop 
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The following table shows the pointers In Example A—25 and the arrays they 
point to. 


Pointer for array 
rr9ptr and rr95ptr rr[i9][i9] andrr[i9+5][i9+5] 
rrxptr and rrx5ptr rr[ix][i9] and rr[ix][i9+5] (where x =0, 1,..., 7) 


rrv9ptr and rrv95ptr rrv[i9] and rrv[i9+5] 


Again, the .D unit is used the most (twenty times per iteration). 


None of the pairs of rr[ix][i9], rr[iy][i9+5] hit the same memory bank (where 
ix, iy = i0, i1, ..., i7). The same is true for pairs rrv[i9], rrv[i9+5], as well as 
for rr[i9][i9] and rr[i9+5][i9+5]. For ease of understanding: 


Lj Load rr[ix][i9], rr[ix][i9+5] together. 
LJ Load rr[i9][i9], rr[i9+5][i9+5] together. 
J Store rrv[i9], rrv[i9+5] together. 


In this way, each iteration takes ten cycles without any memory bank hits. You 
double the speed by unrolling the loop once. 


The final assembly code is shown in Example A-26. 
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Example A—26. Assembly Code for the rrv Computation in search_10i40 


KKEKKKE KKK KKK KKK KKK KKK KK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKKEKKEKKAKKK KKK KK KK 


K* 


Kk 


K* 


Kk 


K* 


k* 


K* 


Kk 


K* 


kK* 


K* 


K* 


KKEKKKKKKE KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKKKKKK KKK KK KK 


Texas Instruments, Inc 


Implementation of the rrv Computation in search_10i40 in EFR 


Compute two rrvs a time 


Total e¢ycles 


55 


Register Usage: 


MVK 
MVK 


MVK 


MP YU 
SHL 
SUB 
ADD 


MP YU 
ADD 
MVK 


MP YU 
ADD 
ADD 
ADD 
ADD 
MP YU 


sl 
-S2 


-S2 


M2 
-S1X 
~L2 
-S2X 


.M2 
-L2X 
Sl 


M2 
-L1X 
~L2 
sl 
-S2X 
-M1 


A 


16 


B 


14 


; B4 --- i0 

‘BS == ail 

; B6 --- 12 

; BY -===.23 

; A& --- 14 

; BQ ==> a5 

7 ALQ == 426 

«All == 17 

; BS == 19 

; A15 --- &rr[0] [0] 

; AO --- &rrv[0] 

; B14 --- stack pointer 
410,A2 offset of rr[i9][i9] 
410,B2 offset of rr[i9+5] [i9+5] 
82,BO0 
B3,B0,B3 [i9] [19] 
B3,1,A13 
BO,2,B0 80 
A15,B2,B13 érr[5] [5] 
B4,B0,B4 [i0] [0 
A15,10,B15 éxrr [0] [5] 
80,Al 
B5,B0,B5 [il] [0 
B3,A15,A3 érr[i9] [i9] 
B3,B13,B3 &érr[i9+5] [i9+5] 
A15,A13,A15 érr[0] [i9] 
B15,A13,B15 érr[0] [i9+5] 
A10,A1,A10 [i6] [0 


Kk 


k* 


K* 


k* 


Kk 


k* 


K* 


K* 


K* 


K* 


K* 


K* 
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Example A-26. Assembly Code for the rrv Computation in search_10i40 (Continued) 


MP YU .M2 B6,B0,B6 

MP YU .M1 Al11,A1,A11 
ADD -L1X B4,A15,A4 

ADD .L2 B4,B15,B4 

LDH -D1 *A3++[A2],A13 
LDH .D2 *B3++[B2],B13 
ADD Si AO,A13,A0 

MP YU .M2 B7,B0,B7 

MP YU .M1 A8,A1,A8 

ADD .L1X B5,A15,A5 

ADD .L2 B5,B15,B5 

ADD pope A10,A15,A10 
ADD ~S2X A10,B15,B10 
LDH -DL *A4++[10],A13 
LDH »D2 *B4++[10],B13 
MP YU .M2 B9,BO,B9 

ADD .L1X B6,A15,A6 

ADD .L2 B6,B15,B6 

ADD oa! Al11,A15,A11 
ADD ~S2X Al11,B15,B11 
LDH -D1 *A5++[10],A13 
LDH D2 *B5++[10],B13 
ADD .L1X B7,A15,A7 

ADD .L2 B7,B15,B7 

ADD .S1 A8,A15,A8 

ADD ~S2X A8,B15,B8 

LDH .D1 *A6++[10],A13 
LDH «D2 *B6++[10],B13 
ADD .L1X B9,A15,A9 

ADD .L2 B9,B15,B9 

LDH ae 0 *A7++[10],A13 
LDH .D2 *B7,B13 

MVK 752 2048,B7 

LDH DL *A8++([10],A13 
LDH D2 *B8++[10],B13 
SMPY .M1X A13,B7,A12 
SMPY M2 B13,B7,B12 
SHL S2 Bi, i,BI 

ADD .L2X AO,10,B0 


’ 


[i2] [0] 

[i7] [0] 
&rr[i0] [i9] 

&xrr ee ee 
load rr[i9] [19] 
load ee ee 
&rrv[i9] 

[i3] [0] 

[14] [0] 

&rr[il] [i9] 
&rr[il] [i19+5] 
&rr[i6] [9] 
&rr[i6] [i9+5] 


load rr[i0] [i9] 
load rr[i0] [i9+5] 


[i9] [0] 
&érr[i2] [i9 
&érr[i2] [i9+5] 
&rxr[i7] [i9 
&rr[i7] [i9+5] 


load rr[il] [i9] 
load rr[il] [i9+5] 


&rr[i3] [i9 
&érr[i3] [i9+5] 
&érr[i4] [i9 
&rr[i4] [i9+5] 


load rr[i2][i9] 
load rr[i2] [i9+5] 


&rr[i5][i9 
&rr[i5] [i9+5] 
load rr[i3][i9] 
load rr[i3] [i9+5] 
_1_16 


load rr[i4] [19] 
load rr[i4] [i9+5] 
s=smpy (rr[i9] [i9],_1_16) 


S=smpy (rr[i9+5] [i9+5],_1_ 


1_8 


&rrv[i9t5] 


16) 
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LOOP: 


nner Ee 


Aur 


had CS 


hae OD 


-D1 
-D2 


-M1X 


*A9++[10],A13 
*B9++[10],B13 
A13,B7,A15 
B13,B7,B15 


*A10++[10],A13 
*B10++[10],B13 
A13,B7,A15 
Bl3;,,B/,BL5 


*A11++[10],A13 
*B11++[10],B13 
A13,B7,A15 
B13,B7,B15 
A12,A15,A12 
B12,B15,B12 
3,Al 


A13,B7,A15 
B13,B7,B15 
A12,A15,A12 
B12,B15,B12 
32767,A14 


Al12,A15,A12 
B12,B15,B12 
A13,B7,A15 
B13,B7,B15 
*A3++[A2],A13 
*B3++[B2],B13 
A14,1,A14 


A13,B7,A15 
B13,B7,B15 
A12,A15,A12 
B12,B15,B12 
*A4++[10],A13 
*B4,B13 


A13,B7,A15 
Bi3,,B/,BLS 
A12,A15,A12 
B12,B15,B12 
*A5++[10],A13 
*B5++[10],B13 


, 


, 


, 


, 
, 
, 
, 
, 


r 


;* load rr[i9+5 


;* load rrf[i 
c* load reli 


load rr[i5] [i9 
load rr[i5] [i9+5] 
s0=smpy (rr[i0] 
SO=smpy (rr[i0] 
load rr[i6] [i9 
load rr[i6] [i9+5] 
sl=smpy (rr[il] 
Sl=smpy (rr[il] 


load rr [i9 


[i7 
load rr[i7] [i9+5] 
(rr 


s2=smpy i2] 
S2=smpy (rr[i2] 
s=sadd(s,s0) 
S=sadd(S,S0) 
loop counter 


19] 7218) 
£9+5)]\,21_8) 


i9],_1_8) 
i19+5],_1_8) 


i9],_1_8) 
19+5)]);—1_8) 


s3=smpy (rr[i3] [i9],_1_8) 
S3=smpy (rr[i3] [i9+5],_1_8) 


s=sadd(s,s1) 
S=sadd(S,S1) 


s=sadd(s,s2) 
S=sadd(S,S2) 
s4=smpy (rr[i4 
S4=smpy (rr[i4 


;* load rr[i9] [i9] 


i9],_1_8) 
19+5],_1_8) 


19+5] 


32768 for rounding 


s5=smpy (rr[i5] [i9],_1_8) 
S5=smpy (rr[i5] [i9+5],_1_8) 
s=sadd(s,s3) 

S=sadd(S,S3) 

* load rr[id] [19] 

* load rr[id] [19+5] 


s6=smpy (rr[i6] [i9],_1_8) 
S6=smpy (rr[i6] [i9+5],_1_8) 


s=sadd(s,s4) 
S=sadd(S,S4 


1 
1 


) 
][i9 
] [19+5] 


] 
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Example A-26. Assembly Code for the rrv Computation in search_10i40 (Continued) 


SMPY .M1X A13,B7,A15 
SMPY .M2 B13,B7,B15 
SADD .~L1 A12,A15,A12 
SADD L2 B12,B15,B12 
LDH DL *A6++[10],A13 
LDH D2 *B6++[10],B13 
ADD S2x A7,10,B7 
SADD L A12,A15,A12 
SADD L2 B12,B15,B12 
LDH D *A7++[10],A13 
LDH D2 *B7,B13 
MVK S2 2048,B7 

[Al] B Ss LOOP 
SADD -L1 A12,A15,A12 
SADD .L2 B12,B15,B12 
LDH D *A8++[10],A13 
LDH D2 *B8++[10],B13 
SMPY .M1X A13,B7,A12 
SMPY .M2 B13,B7,B12 
SHL iS2 B7,1,B7 

[Al] SUB Sil Al,1,Al1 
SADD .-L1 Al12,A14,A14 
SADD .L2X B12,A14,B4 
LDH POD EIE *A9++[10],A13 
LDH .D2 *B9++[10],B13 
SMPY .M1X A13,B7,A15 
SMPY .M2 B13,B7,B15 
SHR .ol Al14,16,A14 
SHR ~52 B4,16,B4 
SMPY .M1X A13,B7,A15 
SMPY .M2 B13,B7,B15 
LDH et *A10++[10],A13 
LDH D2 *B10++[10],B13 
SMPY .M1X A13,B7,A15 
SMPY .M2 B13,B7,B15 
SADD -L1 A12,A15,A12 
SADD .L2 B12,B15,B12 
LDH Pa kse *A11++[10],A13 
LDH ~D2 *B11++[10],B13 


, 


, 


, 
’ 


v 


ir 
, 


, 


s7=smpy (rr[i7] [i9],_1_8) 
S7=smpy (rr[i7] [i9+5],_1_8) 
s=sadd(s,s5) 

S=sadd(S,S5) 


7* Load rx 


* load rrl[ 


12] [19] 
12] [19+5] 


&rxr[i3] [19+5] 


s=sadd(s,s6) 
S=sadd(S,S6) 


* load rr[i3] [i9] 


3] 
;* load rr[i3] [i9+5] 


_1_16 
branch to the loop 


s=sadd(s,s7) 
S=sadd(S,8S7) 


* load rr[i 
* load. rr[a 


* s=smpy (rr 


4} [19] 
4} [19+5] 
[19] [i9],_1_16) 


* S=smpy (rr[i9+5] [i9+5],_1_16) 
_1_8 
decrement loop counter 


round(s 
round(S 


rrv[i9] 
ery (2935) 
* sl=smpy (rr[il] [i9],_1_8) 
* Sl=smpy (rr[il] [i9+5],_1_8) 
* load rr[i6] [i9 

* load rr[i6] [i9+5] 


* s2=smpy (rr[i2] [i9],_1_8) 

* S2=smpy (rr[i2] [i9+5],_1_8) 
* s=sadd(s,s0) 

* S=sadd(S,S0) 

* load re[i7v] [29 

* load rr[i7] [i9+5] 
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Example A—26. Assembly Code for the rrv Computation in search_10i40 (Continued) 


STH 
hy STH 
I] SMPY 
a SMPY 
im SADD 
im SADD 
['{ ADD 
| | MVK 


-D1 
-D2 
-M1X 
M2 
-L1 
~L2 
~S2X 
Sl 


A14, *A0++[10] 
B4, *B0++[10] 


store rrv[i9] 
store rrv[i9t+5] 


A13,B7,A15 ;* s3=smpy (rr[i3] [i9],_1_8) 
B13,B7,B15 7* S3=smpy (rr[i3] [i9+5],_1_8) 
A12,A15,A12 ;* s=sadd(s,sl) 

B12,B15,B12 ;* S=sadd(S,S1) 

A4,10,B4 a* &xre [20] [2945] 

32767,A14 ; end of LOOP 


Because of the shortage of registers: 


_j B7 serves as_1_16,_1_8 and as the pointer for rr[i3][i9+5]. 
1 B4 serves both the value of rrv[i9+5] and the pointer to rr[i0][i9+5]. 
(1 A14 represents Ox8000L as well as rrv[i9]. 


The last iteration of the loop can be expanded as the epilog of the loop to over- 
lap with the prolog of the code that follows this part of the code. 
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A.2.5 Implementation of the Index Search in search_10i40 


The index search in search_10i40 is the core of search_10i40. The C code is 
shown in Example A-27. 


Example A-—27. C Code for the Index Search for search_10i40 


#define L_CODE 40 

#define STEP 5 

#define _1 16 (Word16) (32768L/16) 
#define _1 8 (Wordl16) (32768L/8) 


input: 
Word16 rr[L_CODE] [L_CODE], ipos[L_CODE], dn[L_CODE]; 


local variables/arrays: 
Wordl6 rrv[L_CODE]; 
Word16 i0,i1,12,13,i4,i15,i16,i17,i18,i19; /* defined on [0,L_CODE-1] */ 
Wordl6 ia,ib; 
Word16 ps,ps0,psl,ps2,sq,sq2; 
6 
2 


Word1l6 alp,alp_16; 
s,alp0,alpl,alp2; 


(The values of iO, i1, i2, i3, i4, i5, 16, i7 , psO, and alpO have 
been obtained before entering this loop.) 


Original C code 


sq = -l; 

alp = 1; 
ps = 0; 

ia = ipos[8]; 
ib = ipos[9]; 


/* initialize 10 indices for i8 loop (see i2-i3 loop) */ 
for (i8 = ipos[8]; i8 < L_CODE; i8 += STEP) 
{ 


psl = add (ps0, dn[i8]); 


alpl = L_mac (alp0O, rr[i8][i8], _1_128); 
alpl = L_mac (alpl, rr[i0][i8], _1_64); 
alpl = L_mac (alpl, rr[il][i8], _1_64); 
alpl = L_mac (alpl, rr[i2][i8], _1_64); 
alpl = L_mac (alpl, rr[i3][i8], _1_64); 
alpl = L_mac (alpl, rr[i4][i8], _1_64); 
alpl = L_mac (alpl, rr[i5][i8], _1_64); 
alpl = L_mac (alpl, rr[i6][i8], _1_64); 
alpl = L_mac (alpl, rr[i7][i8], _1_64); 
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Example A-—27. C Code for the Index Search for search_10i40 (Continued) 


} 


/* 


for 


{ 


initialize 3 indices for i9 inner loop (see i2-i3 loop) 
(19 = ipos[9]; i9 < L_CODE; i9 += STEP) 
ps2 = add (psi, dn[i9]); 
alp2 = L_mac (alpl, rrv[i9], _1_8); 
alp2 = L_mac (alp2, rr[i8][i9], _1_64); 
sq2 = mult (ps2, ps2); 
alp_16 = round (alp2); 
s = Lmsu (L_mult (alp, sq2), sq, alp_16); 
if (s > 0) { 

sq = sq2; 

ps = ps2; 

alp = alp_16; 

ia = i8; 

ib = i9; 


where add(a,b) 
L_mac (a,b,c) 
mult (a,b) 
L_mult (a 
round(a) = 
L_msu (a,b, c)=_ssub(a,_smpy (b,c) ) 


and 


= _sadd(a<<16,b<<16) >>16 
= _sadd(a,_smpy (b,c) ) 
= _smpy (a<<16,b<<16) >>16 
,b)=_smpy (a,b) 
_sadd(a,0x8000L) >>16 


a 


This is a typical example of the performance being limited by data dependency 
constraints. In this case, the dependency is between the values of alp and sq. 
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A.2.5.1 Rearranging the C Code 


To avoid the unnecessary shift, ps, ps1, ps2, alp, alp 16, sq, and sq2 are 
implemented as int (Word32) variables. The calculations are implemented as: 


Original Implemented as 

psl = add (ps0, dn[i8]); psl = _sadd(ps0O, dn[i8]<<16); 
ps2 = add (psl, dn[i9]); ps2 = _sadd(psl, dn[i9]<<16); 
sq2 = mult (ps2, ps2); sq2 = _smpyh(ps2,ps2); 

alp_16 = round(alp2); alp_16 = _sadd(alp2,0x8000L) ; 


There is no need to compute s explicitly. Instead of implementing the following 
sequence: 
s = Lmsu (L_mult (alp, sq2), sq, alp_16); 


if (s > 0) 
{ 


sq = sq2; 

ps = ps2; 
alp = alp_16; 
ia = 18; 

ib = 19; 


you can do this sequence to fulfill the same task: 


if(_smpyh(alp,sq2) > _smpyh(sq,alp_16)) { 


sq = sq2; 

ps = ps2; 

alp = alp_16; 
ia = 18; 

ib = 19; 


} 
A.2.5.2 Performance Analysis 


The instructions to execute one iteration of the inner loop are shown in 
Example A-28. 
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Linear Assembly for the Index Search for search_10i40 (Inner Loop) 


INNERLOOP: 
LDH 
SHL 
SADD 
SMP YH 
LDH 
SMPY 
SADD 
LDH 
SMP Y 
SADD 
SADD 
SMP YH 
SMP YH 
CMPGT 
[cndr] MV 
[cndr] MV 
[cndr] MV 
[cndr] MV 
[cndr] MV 
[icntr]SUB 
ficntr]B 


=D) *dn9ptrt++[5],dn9 ; load dn[i9] 

ate) dn9,16,dn9h ; OGn[i9] << 16 

.L psl1,dn9h,ps2 ; ps2 = sadd(psil, dn[i9] << 16) 
ps2,ps2,sq2 ; sq2 = smpyh(ps2,ps2) 

-D *rrvpir++ [S],rErv ; load rrv[i9] 

5 rrv,_1_8,tmpl ; smpy(rrv[i9], _1_8) 

.L alpl,templ,alp2 ; alp2=sadd(alpl,smpy (rrv[i9],_1_8) ) 

.D *rr89prtt++, rr89 ; load rr[i8] [i9] 

: rrv89,_1_64,tmp2 7 smpy(rr[i8] [i19],_1_64) 

.L alp2,tmp2,alp2 ; alp2=sadd(alp2,smpy (rr[i8] [i9],_1_64) ) 

.L alp2,0x8000L,alp_16 ; alp_16=sadd(alp2,0x8000L) 
alp,sq2,tmp3 7; smpyh(alp,sq2) 

: sq,alp_16,tmp4 7 smpyh(sq,alp_16) 

~L tmp3,tmp4, cndr ; if(smpyh(alp,sq2) > smpyh(sq,alp_16) ) 


. ALU sq2,sq 

- ALU ps2,ps 

.ALU alp_16,alp 

. ALU i8,ia 

. ALU i9,ib 

- ALU Lente, 1yiontr 

S INNERLOOP ;branch to the loop 


Because both sq and alp are carried over and required from one iteration to 
the next, their values should be put in registers to allow speedy retrieval. At 
least four cycles are required to compute new sq and alp values, and the 
requirement on the functional units does not exceed four execution packets. 
Therefore, the inner loop can be effected in four cycles per iteration. 


For the outer loop, any pair of rr[ix][i8], rr[iy][i8] (where ix, iy = i0, i1, ..., i7) 
will definitely hit the memory bank if they are read together. Therefore, they 
should be loaded in one cycle each. 


A.2.5.3 Partitioning the Registers 


The total number of registers required for this code, including the registers for 
the pointer of the arrays, loop counters, intermediate results, etc., exceeds the 
number of registers available. To partition the registers without losing speed, 
the strategies are: 


Lj For the inner loop, store the results of ps, ia, and ib, whose values are not 
used in this code. 


_j For the outer loop, store the pointers of arrays starting at rr[i5][i8], 
rr[i6][i8], and rr[i7][i8], whose values are needed last in the outer loop. 


Applications Programming A-41 


Part IV 


Part IV 


Implementation of the GSM EFR Vocoder 


Assume that before entering this code, the following values are known: &dn[0], 
&ipos[0], &rr[O][O], &rrv[O][O], iO, i1, i2, 13, i4, i5, i6, 17, psO, and alpO. Assume 
that the short (Word16) integers are stored in the stack in the order iO, i1, i2, 
i3, i4, i5, i6, 17, ia, and ib, and that a pointer &local_16[0], pointing to iO, is also 
known. The int integers and the pointers of the rr arrays are stored in the stack 
in the following order: ps0, ps, alp0, alp1, &rr[i5][i8], &rr[i6] [i8], and &rr[i7] [i8]. 
The pointer, &local_32[0], pointing to psO, is known as well. 


The C code is shown in Example A-29. 


Example A-29. Modified C Code for the Index Search 


sq = -l; 

alp = 1; 

local_32[1] = 0; 
local_16[8] = ipos[8]; 
local_16[9] = ipos[9]; 


/* initialize 10 indices for i8 loop (see i2-i3 loop) */ 


for (i8 = ipos[8]; i8 < L_CODE; i8 += STEP) { 

psl = _sadd (local_32[0], dn[i8]<<16); 

lecal_ 323 sadd(local_32[2], _smpy(rr[i8][i8], 128) ) ; 
local_32[3] = _sadd(local_32[3], _smpy(rr[i0][i8], _1_64)); 
local_32[3] = _sadd(local_32[3], _smpy(rr[il] [i8], _64)); 
local_32[3] = _sadd(local_32[3], _smpy(rr[i2][i8], _1_64)); 
local_32[3] = _sadd(local_32[3], _smpy(rr[i3][i8], _1_64)); 
local_32[3] = _sadd(local_32[3], _smpy(rr[i4][i8], _1_64)); 
local_32[3] = _sadd(local_32[3], _smpy(rr[i5][i8], _1_64)); 
local_32[3] = _sadd(local_32[3], _smpy(rr[i6][i8], _1_64)); 
local_32[3 sadd(local_32[3], _smpy(rr[i7][i8], _1_64)); 


/* initialize 3 indices for i9 inner loop (see i2-i3 loop) */ 


for (19 = ipos[9]; i9 < L_CODE; i9 += STEP) { 


ps2 = _sadd(psl, dn[i9]<<16); 
alp2 = _sadd(local_32[3], _smpy(rrv[i9], _1_8)); 
alp2 = _sadd(alp2, _smpy(rr[i8][i9], _1_64)); 


sq2 = _smpyh(ps2, ps2); 


alp_16 = _sadd(alp2,0x8000L) ; 
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Example A-—29. Modified C Code for the Index Search (Continued) 


if 


(_smpyh(alp,sq2) > _smpyh(sq,alp_16)) { 


sq = sq2; 
local_32[1]= ps2; 
alp = alp_16; 
local_16[8] = i8; 
local_16[9] = i9; 


A.2.5.4 Final Assembly Code 


The final code consists of the following steps: 


Step 1: Load i0, i1, ... i9, alp0, and ps0; and initialize sq, ia, and ib. Part 
of the code overlaps that of the last iteration of the code in section 
A.2.4 on page A-27. 


Step 2: Obtain the pointer for the arrays started at rr[i0][i8], rr[i1][i8], 
.. Ei7][i8], er[i8][i9], rrv[i9], dn[i8], and dn[i9]. 


Step 3: Load rr[i0][i8], rr[i1][i8], ... rr[i7][i8] and dn[i8], compute 
the new ps1 and alp1, update the pointers, and store pointers 
&rr[i5][i8], &rr[i6][i8], and &rr[i7][i8]. 


Step 4: Loadrr[i8][i9], rrv[i9], and dn[i9]. Compute alp2, ps2, alp_16, 
sq2 and perform acomparison. Update the parameters ia, ib, alp, 
sq, and ps based on the comparison result. Repeat this step eight 
times. 


Step 5: Reload the values of psO and alpO, and &rr[i5][i8], &rr[i6][i8], 
and &rr[i7][i8]. Verify that step 3 has been repeated eight times. 
If not, go to step 3. If so, exit. 


To avoid memory bank hits, arrays rr and rrv must not be aligned on the same 
word or half-word boundary. The same applies to arrays rr and dn. As you can 
see in the final assembly code shown in Example A-30, there are several 
places that LDH (or STH) and LDW (or STW) occur in the same execution 
packet. They belong to one of the two categories; that is, always loading values 
from or storing values to the same memory locations, as in iterations like this: 


LDW ~DL *+A6[3],All ; load alpl 
| | [B2] STH «D2 B13,*+B6[9] ; store ib=i9 
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The following instructions are used in the inner loop in different memory locations 
such as the outer loop: 


[B2] STW -D1 B1l1,*+A6[1] ; store ps 
| | LDH -D2 *B10++[5],A5; load rr[i5] [i8] 


In the former case, memory bank hits can be completely eliminated by 
allocating the corresponding arrays in memory properly. Memory bank hits 
occur in every other iteration in the latter case, however. Although, in general, 
you should avoid writing such code, in this case, the performance of the prolog 
of the outer loop after the first iteration is limited by the .D unit. You still save 
some cycle counts in this example. 


To improve the performance, the last two iterations of the inner loop overlap 
part of the prolog of the outer loop. 


Example A-30. Assembly Code for the search_10i40 Index Search 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


Texas Instruments, Inc ae 
K* 

Implementation of The Index Search in search_10i40 in EFR A 
Kk 

Total cycles = 400 (among the 400 cycles, 10 cycles are caused am 
by memory bank hits) me 

K* 

Register Usage: A B ae 
K* 

ules) 15 ae 


LDH 


LDH 
LDH 
MV 


KEKE K KKK KKK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KEK KKK KKK KK KKK KKK KEK KKK KEK KKK KKK KKK KKKKKEKK 


KEKKK KKK KEK KKK KKK KEK KKK KKK KKK KEK KKK KKK KEK KKK KKK KEK KKK KEKE KKK KKK KKK KEK KKK KEK KKK KKK KKK KEKKKK KK 


Dil. 


«Di. 
D2 
.S1X 


Kk 


; Al3 --—- &ipos[0] and alp 

; B6 --- &local_16[0] 

; A6é --- stack pointer, point to &local_32[0] 

; B8 --- &rr[0] [0] 

, A4 -—- &rrv[0] 

; B14 -—- &dn[0] 

; Bl --- reserved for the counter of the 

; outmost loop in search_10i40 
*+A13[8],A7 ; load i8 = ipos[8] 
*+A13[9],B13 ; load i9 = ipos[9] 
*B6,A13 ; load i0 
B6,A5 ; &local_v16[0] 
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Example A—30. Assembly Code for the search_10i40 Index Search (Continued) 


LDH 
LDH 
MVK 


UO 
OO 


= ae a 
OU OO Ga 


D2 
-D1 
od 


-L1X 


*+B6[2],B9 
*+A5[1],A14 
0,A8 


*+A5[4],A15 
*+B6[3],B10 
80,A0 
80,BO0 


A8, *+A6[1] 
*4+B6([5],B1l 
A7,1,B10 
A7,A0,A12 


A7,*+A5[8] 
B13, *+B6[9] 
B8,B10,B2 
A13,B0,B3 


*A6,B15 
*+B6[6],Al 
A12,B2,A12 
B14,B10,B7 
B8,A12,B8 
A14,A0,A14 
B9,BO,B9 


*+A6[2],Al11 
*+B6[V7],B5 
B13,B13,B12 
B3,B2,B3 


*A12,A5 
*B7++[5],B12 
B14,B12,B14 
Al4,B2,A14 


*B3++[5],A5 
B9,B2,B9 
B10,A0,A9 


*A144++[5],A5 
A4,B12,A4 
A15,A0,A15 
B11,B0,Bl11 


load i2 
load il 


could insert two .D 
units here for the store 


of rrv[i9+30] 


in the code which this piece 


and rrv[i9+35] 


immediately follows 


load i4 
load i3 


ps=0 

load i5 
[0] [18] 
[i8] [0] 


store ia=i8 
store ib=i9 
&rr[0] [i8] 
[i0] [0] 


load ps0 
load i6 
érr[i8] [i8] 


édn[i8] 
&xrxr[i8] [0] 
i1]) [0] 
12] [0] 
load alp0d 
load i7 

0) [19] 


&rr[id] [i8 


load dn[i8 
&dn[i9] 
&rr[il] [i8 


&érr[i2][i8 
13] [0] 


&érrv[i9 
14] [0] 
15] [0] 


load rr[i8] [i8] 


load rr[i0] [i8] 


load rr[il] [i8] 
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Example A-30. Assembly Code for the search_10i40 Index Search (Continued) 


LDH 2DzZ *B9++[5],A5 ; load rr[i2] [i8] 
| MVK ow 256,A0 ; AO=_1_128 
| ADD L1X A9,B2,A9 , &xrxvr[i3] [i8 
| MP YU .M1 Al,A0,Al ; [16] [0 
LDH -DL *A94+4([5],B12 j load rr[i3] [18] 
ADD .D2 B11,B2,B10 ; &rr(i5] [18 
MVK rowdy 7,A2 , Outer loop counter 
MVK Ars) 512,B0 ; BO=_1_64 
ADD dX A15,B2,A15 ; &xrxvr[i4 i8 
ADD ~L2 B8,B12,B4 ; &xrxvr[i8] [i9 
MPYU .M2 B5,B0,B5 ; (277 (8 
LDH 3 *A15++[5],A5 ; load rr[i4] [i8] 
SHL «ol AO,1,A0 ; _1_64 
SHL .S2 B12,16,B11 ; adn[i8] << 16 
ADD -L1X Al,B2,Al1 ;, &xrxr[i6] [i8 
SMPY -M1 A5,A0,A8 7 smpy (rr[i8] [i8],_1_128) 
LDH .D2 *B10++[5],A5 ; load rr[i5d] [i8] 
MVK SL -1,A3 7 sq=-1 
SMPY -M1 A5,A0,A8 7 smpy(rr[id] [i8],_1_64) 
LDH -DL *A1++[5],B12 ; load rr[ié] [i8] 
ADD .D2 B5,B2,B11 ; &xrxvr[i7] [i8 
SHL «Sl AO,7,A13 7 alp=0x10000 
SADD -L1 A11,A8,A11 ; alpl=sadd(alp0, smpy (rr[i8] [i8],_1_128) ) 
SADD ede B15,B11,B15 ; pel 
SMPY -M1 A5,A0,A8 7 smpy (rr[il] [i8],_1_64) 
LDH .D2 *B11++[5],A5 ; load rr[i7] [i8] 
SADD -L1 A11,A8,A11 ; alpl=sadd(alpl,smpy(rr[i0] [i8],_1_64) ) 
SMPY -M1 A5,A0,A8 7 smpy(rr[i2] [i8],_1_64) 
OUTERLOOP: 
LDH -DL *A4++[5],A5 3; load rrv[19] 
| LDH D2 *B44+4+(5],B1l2 ; load rr[i8] [i9] 
| SADD il A11,A8,A11 ; alpl=sadd(alpl,smpy (rr[il] [i8],_1_64) ) 
| SUB ~L2 B13,5,B13 
| SMPY -M1X B12,A0,A8 7 smpy (rr[i3] [i8],_1_64) 
LDH «D2 *B1444+[5],B12 ; load dn[i9 
| SADD adel A11,A8,A11 ; alpl=sadd(alpl,smpy (rr[i2] [i8],_1_64) ) 
| SMPY -M1 A5,A0,A8 7 smpy (rr[i4] [i8],_1_64) 
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Example A—30. Assembly Code for the search_10i40 Index Search (Continued) 


ms Et we 


B10, *+A6[4] 
Al1,A8,A11 
A5,A0,A8 


Al, *+A6[5] 
AO, 6,A10 

Al1,A8,A11 
B12,A0,A8 


*A4++[5],A5 
*BA++[5],B12 
AO, 3,A0 
Al1,A8,A11 
A5,A0,A8 


*B14++[5],B12 


Al1,A8,A11 
A5,A0,A5 
B12,B0,B12 
B11, *+A6[6] 
B12,16,Bl 
Al1,A8,A11 
All, *+A6[3] 
Al1,A5,A5 
B11,B15,B5 


*A4++[5],A5 
*B4++[5],B12 
INNERLOOP 
A5,B12,Al 
B5,B5,B8 


*B14++[5],B12 
4,Al 

0,B2 
Al1,A10,A8 
A5,A0,A5 
B12,B0,B12 


r 


, 


’ 


7** load rrv[i9] 


, 
’ 


r 


7** load dn[i9] 


, 


, 


r 


’ 


;* load dn[i9 


store &rr[i5] [i8+5] 
alpl=sadd(alp1, smpy (rr[i3] [i8],_1_64) ) 
smpy (rr[i5] [i8],_1_64) 


store &rr[i6é] [i8t+5] 

0x8000L 
alpl=sadd(alpl, smpy (rr[i4] [i8],_1_64) ) 
smpy (rr[i6] [i8],_1_64) 


* load rry[i9] 

* load rr(is8] [19] 
A0=_1_8 
alpl=sadd(alp1l, smpy (rr[i5] [i8],_1_64) ) 
smpy (rr[i7] [i8],_1_64) 


alpl=sadd(alpl,smyp (rr[i6] [i8],_1_64) ) 
smpy (rrv[i9],_1_8) 
smpy (rr[i8] [i9],_1_64) 


store &rr[i7] [i8+5] 
dn[i9] << 16 
done alpl=sadd(alpl, smpy (rr[i7] [i8],_1_64) ) 


store alpl 
alp2=sadd(alpl,smpy (rrv[i9],_1_8) ) 
ps2=sadd(ps1,dn[i9]<<16) 


** load rr[i8] [i9] 

branch to the innerloop 
alp2=sadd(alp2,smpy (rr[i8] [i9],_1_64) ) 
sq2=smpyh (ps2,ps2) 


innerloop counter 


alp_16 = sacc(alp2, 0x8000L) 
* smpy (rrv[i9],_1_8) 
* smpy (rr[i8] [i9],_1_64) 
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Example A-—30. Assembly Code for the search_10i40 Index Search (Continued) 


INNERLOOP : 


LDW 
STH 
SHL 
ADD 
SMP YH 
SMP YH 


[B2] 


B2] 
[B2] 


[B2] MV 


[B2] MV 


[B2] 


~S1X 


-S2X 


*+A6[3],A11 
B13, *+B6[9] 
B12,16,B10 
B13,5,;B13 
A8,A3,A11 
B8,A13,B10 


B11,*+A6[1] 
A7,*+B6[8] 
B5,B11 
A11,A5,A5 
B10,B15,B5 


*A4++[5],A5 


*B4++[5],B12 


Al,1,Al 
INNERLOOP 
A5,B12,Al11 
B10,A11,B2 
B5,B5,B8 


A8,A13 


*B14++[5],B12 


B8,A3 
Al11,A10,A8 
A5,A0,A5 
B12,B0,B12 


B11, *+A6[1] 
A7,*+B6[8] 
B12,16,B10 
B5,Bl1l1 
A8,A3,A11 
B8,A13,B10 


*+A6[2],A11 
B13, *+B6[9] 
A6,B2 
Al11,A5,A5 
B10,B15,B5 


, 
, 


, 


, 


, 


load alpl 
store ib=i9 


,* dn[i9]<<16 


i19=i19+STEP 
smpyh (alp_16, sq) 
smpyh (alp, sq2) 


store ps 
store ia = i8 


; *alp2=sadd(alpl, smpy (rrv[i9],_1_8) ) 
;* ps2=sadd(psl1,dn[i9]<<16) 


7*** load rrv[i9+10] 
7*** load rr[ is] [19410] 


, 


, 


, 


decrement innerloop counter 
branch to INNERLOOP 


; *alp2=sadd (alp2, smpy (rr[i8] [i9],_1_64) ) 


if smpyh(alp,sq2) > smpyh(alp_16,sq) 
* sq2=smpyh (ps2,ps2) 


alp=alp_16 


7*** load dn[i9+10] 


, 
, 


, 


;* alp_16=sadd(alp2, 
okkK* AO 


sq=sq2 
0x8000L) 
= _1.8 


;*** BO = _1 64 


, 


end of innerloop 


store ps 

store ia = i8 
dn[i9]<<16 

ps2 

smpyh (alp_16,sq) 
smpyh (alp, sq2) 


load alpo 

store ib=i9 

stack pointer 
alp2=sadd(alp2,smpy (rr[i8] [i9],_1_64) ) 
ps2=sadd (ps1, dn[i9]<<16) 
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Example A—30. Assembly Code for the search_10i40 Index Search (Continued) 


*4+A6[5],Al 
*B2,B15 
205,A0 
A5,B12,A11l 
B10,Al1,B2 
B5,B5,B8 


*++A12[A0],A5 
*B7++[5],B12 
A8,A13 
-90,B14 

B8,A3 


*+A6[4],B10 
*B3++[5],A5 
-90,A4 
A11,A10,A8 
B13,5,B13 


*A14++[5],A5 
B13, *+B6[9] 
A8,A3,A10 
B8,A13,B10 


256,A0 
*+A6[6],B11 
*B9++[5],A5 
OUTERLOOP 


*A9++[5],B12 
A7,*+B6[8] 
B13,5,B13 
B10,A10,B0 


*A15++[5],A5 
B13, *+B6[9] 
AO,1,A0 
-35,B13 
A8,A13 
A5,A0,A 


a1.128 


_1_64 
update i9 
alp=alp_16 


&rxr[i6] [i8] 
load ps0 


alp2=sadd(alp2,smpy (rr[i8] [i9],_1_64) ) 
if smpyh(alp,sq2) > smpyh(alp_16,sq) 
sq2=smpyh (ps2,ps2) 


load rr[i8] [i8] 
load dn[i8] 
alp=alp_16 
é&dn[i9] 

sq=sq2 


&rr[i5] [i8 
load rr[i0d 
&rrv[i9] 

alp_16=sadd(alp2, 0x8000L) 


] 
] [18] 


load rr[il] [i8] 
store ib=i9 
smpyh (alp_16,sq) 
smpyh (alp, sq2) 


&rxr[i7] [i8] 
load rr[i2] [i8] 
branch to OUTERLOOP 


load rr[i3] [i8] 

store ia = i8 

update i9 

if smpyh(alp,sq2) > smpyh(alp_16,sq) 


load rr[i4] [i8] 
store ib=i9 


smpy (rr[i8] [i8],_1_128) 
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Example A-—30. Assembly Code for the search_10i40 Index Search (Continued) 


[BO] 


BO] 


NnNn PN 
iw) 
iw) 
x 


.D1B11, *+A6[1] 
D2 
-S1X 
2 
-L1 


*B10++[5],A5 
B8,A3 
B12,16,B11 
A2,1,A2 
A5,A0,A8 


*A1++[5],B12 
A7, *+B6[8] 
310,B4 
Al1,A8,A11 
B15,B11,B15 
A5,A0,A8 


B5,*+A6[1] 
*B11++[5],A5 
A7,5,A7 
Al1,A8,Al11l 
A0,BO 
A5,A0,A8 


store ps 

load rr[i5][i8 
sq=sq2 

dn[i8] << 16 
decrement 

smpy (rr[i0] [i8 
load rr[i6] [i8 
store ia = i8 
&rr[i8] [i9 
alpl=sadd(alp0, 
psl = sadd(ps0, 
smpy (rr[il] [i8 
store ps 

load rr[i7][i8 
update i8 
alpl=sadd(al 
_1_64 

smpy (rr[i2] [i8 


OUTERLOOP counter 


,_1_64) 


smpy (rr[i8] [i8],_1_128) ) 
dn [i8]<<16) 
,_1_64) 


pl, smpy (rr[i0] [i8],_1_64) ) 


,_—1_64) 
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A.2.6 Implementation of the FIR Filter, residu.c, in GSM EFR Vocoder 


Example A—31 shows the C code for the FIR filter, residu.c, in the GSM EFR 
vocoder. 


Example A-31. C Code for residu.c 


#define Word1l6 short #define Word32 int 
Original C code 
/* m = LPC order == 10 */ #define m 10 
void Residu ( 
Word16 a[], /* (1) : prediction coefficients 7%: ff 
Word16 x[], /* (i) speech signal */ 
Wordl6 yl], /* (0) : residual signal A 
Wordl6 lg /* (i) size of filtering Bi 
) 
{ 
Wordl6 i, 3; 
Word32 s; 
for (i = 0; i < lg; itt) 
{ 
s = Lmult (x[i], a[0]); 
for (j = 1; j <= m; jtt) 
S = L.mac (s, alj]l, x[4 —- 31); 
s = L_shl (s, 3); 
y[i] = round (s); 
} 
return; 
} 
where L_mult (a,b) = _smpy(a,b) 
L_mac(a,b,c) = _sadd(a,_smpy (b,c) ) 
L_shl(a,b) = (b>0) ? _sshl(a,b) : a >> (-b) 
round(a) = _sadd(a,0x8000L) >>16 
and lg = 40. 


A.2.6.1 Rearranging the C Code 


L_shl (s, 3) can be implemented simply as _sshl (s,3). Because array a has 
dimension m + 1 = 11 and the inner loop is always executed 10 times per outer 
loop iteration, you can completely unroll the inner loop to gain speed by 
representing array a with registers. Because a is a short integer array, it 
requires six registers at most for full representation. You can assign one 
register only for a[0] for the following reasons: 


Lj a0] is always a constant, 4096 
(J _shr (Ox8000L, 3) = 4096 
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You can change the order of rounding and left shift to save one register. (Other- 
wise, you need another register for Ox8000L.) The C code, after complete 
inner loop unrolling, is shown in Example A-32. 


Example A—32. C Code for residu.c After Rearrangement Using Intrinsics 


for (i = 0; i < lg; i++) 

{ 
s = _smpy(x[i], a[0]); 
s = _sadd(s,_smpy(a[1], x[i-1])); 
s = _sadd(s,_smpy(a[2], x[i-2])); 
s = _sadd(s,_smpy(a[3], x[i-3])); 
s = _sadd(s,_smpy(a[4], x[i-4])); 
s = _sadd(s,_smpy(a[5], x[i-5])); 
s = _sadd(s,_smpy(a[6], x[i-6])); 
s = _sadd(s,_smpy(a[7], x[i-7])); 
s = _sadd(s,_smpy(a[8], x[i-8])); 
s = _sadd(s,_smpy(a[9], x[i-9])); 
s = _sadd(s,_smpy(a[10], x[i-10])); 
s = _sadd(s, a[0]); 
s = _sshl(s,3); 
y{i]l = _shr te 16); 

} 


A.2.6.2 Performance Analysis 
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The performance is limited by the .L unit for _sadd because this unit is used 
at least 11 times per iteration. In other words, it takes at least six cycles per 
iteration. You may choose to unroll the loop once to compute two y values per 
iteration for the following reasons: 


1 To satisfy the ordering property of _sadd 


1 To maximize speed: eleven cycles are required to compute two y values, 
while six cycles are needed for one y 


The C code is is shown in Example A-33. 
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Example A-—33. Implemented C Code for residu.c 


for (i = 0; i < lg; i+=2) 

{ 
sO smpy(x[i], a[0]); 
sl = _smpy(x[it+l], a[0]); 
s0O = _sadd(s0,_smpy(a[1l], x[i-1])); 
sl = _sadd(sl,_smpy(a[1], x[il])); 
sO = _sadd(s0,_smpy(a[2], x[i-2])); 
sl = _sadd(sl,_smpy(a[2], x[i-1])); 
s0O = _sadd(s0,_smpy(a[3], x[1i-3])); 
sl = _sadd(sl,_smpy(a[3], x[1i-2])); 
sO = _sadd(s0O,_smpy(a[4], x[i-4])); 
sl = _sadd(sl,_smpy(a[4], x[1i-3])); 
sO = _sadd(s0O,_smpy(a[5], x[i-5])); 
sl = _sadd(sl,_smpy(a[5], x[i-4])); 
s0O = _sadd(s0,_smpy(a[6], x[1i-6])); 
sl = _sadd(sl,_smpy(a[6], x[1i-5])); 
sQ = _sadd(s0,_smpy(a[7], x[i-7])); 
sl = _sadd(sl,_smpy(a[7], x[1i-6])); 
s0O = _sadd(s0,_smpy(a[8], x[i-8])); 
sl = _sadd(sl,_smpy(a[8], x[i-7])); 
sO = _sadd(s0,_smpy(a[9], x[1i-9])); 
sl = _sadd(sl,_smpy(a[9], x[1i-8])); 
sO = _sadd(s0,_smpy(a[10], x[i-10])); 
sl = _sadd(sl,_smpy(a[10], x[i-9])); 
sO = _sadd(s0, a[0]); 
sl = _sadd(sl, a[0]); 
sO = _sshl(s0,3); 
sl = _sshl(s1,3); 
y[i] = _shr(s0,16); 
y[itl] = _shr(s1,16); 

} 


A.2.6.3 Final Assembly Code for residu.c 


The final assembly code is shown in Example A-34. 
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Example A-34. Assembly Code for residu.c 


Kk 


kK* 


Kk 


K* 


Kk 


K* 


Kk 


K* 


Kk 


K* 


Implementation 


Compute two ys 


Total cycles = 


Register Usage: 


SMPY 
SMP YHL 
LDW 
SADD 
SADD 


SMP YHL 
SMPY 
SADD 
SADD 


of residu.c EFR 


at a time 


(1lg/2+1) *11+6 


23) (for lg 


A3,A0,A8 


A3,BO, 


B8 


*A4--,Al 
A8,A9,A9 


B8,B9, 


BQ 


Al1,B4,A8 


A3,B4, 


B8 


A8,A9,A9 


B8,B9, 


BQ 


, 
, 
, 


, 


A4 --- 


B 
10 


&a[0] 
&x[0] 
&y [0] 
1g 


oO 
ia) 
a. 
io) 
fo) 
ll 
od 
fo) 
wo 
ron) 


load a[9] & a[10] 

to take care of the first execution 
a[0] = 4096 

loop counter, L_SUBFR/2 


smpy (x[0],a[0]) 

smpy (x[1],a[0]) 

load x[-6] & x[-5] 

sO = sadd(s0, smpy(x[-9],a[9])) 


smpy (x[-1],a[1]}) 
smpy (x[0],a[1]) 
sO = sadd(s0O, smpy(x[-10],a[10])) 
sl = sadd(sl, smpy(x[-9],a[10])) 


KKEKK KK KKK KKK KKK KEK KEK KKK KKK KEK KKK KEK KKK KKK KKK KEK KKK KEK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKEKKKKKEK 


Kk 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


Kk 


Kk 


KKK K KK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKKKEKKEKKKKKKEKKKKKKK KKK K 
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Example A—34. Assembly Code for residu.c (Continued) 


SMPYLH -M1X Al,B4,A8 7 smpy(x[-2],a[2]) 
SMP YH -M2X Al,B4,B8 7 smpy(x[-1],a[2]) 
ADD ol A8,0,A9 ; sO=smpy (x[0],a[0]) 
ADD SZ B8,0,B9 ; sl=smpy(x[1],a[0]) 
LDW .D1 *A4--,A3 ; load x[-8] & x[-7] 
[!A2] SADD Ll A9,A0,A9 ; sO = sadd(s0, 4096) 
[!A2] SADD .L2 B9,B0O,B9 ; sl = sadd(sl, 4096) 
SMPYHL .-M1X A3,B1,A8 7 smpy(x[-3],a[3]) 
SMPY .M2X Al,B1,B8 7 smpy(x[-2],a[3]) 
SADD -L1 A8,A9,A9 ; sO = sadd(s0O, smpy(x[-1],a[1])) 
SADD ~L2 B8,B9,B9 ; sl = sadd(sl, smpy(x[0],a[1])) 
[!A2] SSHL <6L A9,3,A7 ; sO = L_shl(s0,3) 
[!A2] SSHL <82 BY, 3,B1 ; sl = L_shl(s1,3) 
SMP YLH .-M1X A3,B1,A8 7 smpy(x[-4],a[4]) 
SMP YH -M2X A3,B1,B8 7 smpy(x[-3],a[4]) 
SADD ire A8,A9,A9 ; sO = sadd(s0O, smpy(x[-2],a[2])) 
SADD ~L2 B8,B9,B9 ; sl = sadd(sl, smpy(x[-1],a[2])) 
LDW -D1 *R44+4+([6],Al ; load x[-10] & x[-9] and update the 
pointer 
[!A2] SHR .S1 A7,16,A7 7 y[O] = shr(s0O, 16) 
[!A2] SHR ~S2 B10,16,B10 7 y[l] = shr(sl, 16) 
; to the new &x[0] 
SMPYHL .M1X Al,B5,A8 7 smpy(x[-5],a[5]) 
SMP Y .M2X A3,B5,B8 7 smpy(x[-4],a[5]) 
SADD » Tel A8,A9,A9 ; sO = sadd(s0O, smpy(x[-3],a[3])) 
SADD LZ B8,B9,B ; sl = sadd(sl, smpy(x[-2],a[3])) 
[!A2] STH .D1 AT, *A6++ ; store y[0] 
{B2] SUB -S2 B2,2,B ; decrement loop counter 
{B2] B Sl LOOP ; branch to the loop 
SMP YLH .M1X Al1,B5,A8 7 smpy(x[-6],a[6]) 
SMPYH .M2X Al,B5,B8 7 smpy(x[-5],a[6]) 
SADD etl A8,A9,A9 ; sO = sadd(s0O, smpy(x[-4],a[4])) 
SADD ~L2 B8,B9,B9 ; sl = sadd(sl, smpy(x[-3],a[4]l)) 
LDW «DL *R4-—,A3 7* load x[0] & x[1] for the next iteration 
SMPYHL -M1X A3,B6,A8 7 smpy(x[-7],a[7]) 
SMPY -M2X Al,B6,B8 7 smpy(x[-6],a[7]) 
SADD -L1 A8,A9,A9 ; sO = sadd(s0O, smpy(x[-5],a[5])) 
SADD pale) B8,B9,B9 ; sl = sadd(sl, smpy(x[-4],a[5])) 
LDW eb *AR4—-—-, Al ;* load x[-1] & x[-2] 
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Example A-34. Assembly Code for residu.c (Continued) 


SMP YLH 
SMP YH 
SADD 
SADD 
[!A2] STH 


SMP YHL 
SMPY 
SADD 
SADD 
[A2] SUB 
LDW 


SMP YLH 
SMP YH 
SADD 
SADD 


1X A3,B6,A8 7 smpy(x[-8],a[8]) 

2X A3,B6,B8 7 smpy(x[-7],a[8]) 
slid A8,A9,A9 ; sO = sadd(s0O, smpy(x[-6],a[6])) 
2b2 B8,B9,B9 ; sl = sadd(sl, smpy(x[-5],a[6])) 
-DL B10, *AG++ ; store y[1l 

1X Al,B7,A8 7 smpy(x[-9],a[9]) 

2X A3,B7,B8 7 smpy(x[-8],a[9]) 
Pak A8,A9,A9 7 sO = sadd(s0O, smpy(x[-7],aI[7])) 
U2 B8,B9,B9 ; sl = sadd(sl, smpy(x[-6],a[7])) 
52. A2,1,A2 
ep *A4--,A3 ;* load x[-3] & x[-4] 

1X Al,B7,A8 7 smpy(x[-10],a[10]) 

2X Al,B7,B8 7 smpy (x[-9],a[10]) 
oil A8,A9,A9 ; sO = sadd(s0O, smpy(x[-8],a[8])) 
22 B8,B9,B9 ; sl = sadd(sl, smpy(x[-7],a[8])) 


A.2.7 
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There is no memory bank hit within the loop. To avoid a memory bank hit within 
the prolog of the loop, arrays aand x must be allocated so that a[1] and x[0] are 
offset from each other by one word. Some of the instructions in the loop cannot 
be executed in the first iteration. Register A2 indicates which instructions these 
are. 


Implementation of the Lag Search in the lag_max ( ) Routine 


The lag_max ( ) routine performs an open-loop pitch (or lag) search and 
computes the normalized correlation for the selected lag. This section 
illustrates the implementation of the lag search. The lag search C code is 
shown in Example A-35. 


Example A-35. C Code for the Lag Search in lag_max() 


Implementation of the GSM EFR Vocoder 


#define Word1l6 short 
#define Word32 int 

#define MIN_32 0x80000000L 
#define PIT_MAX 143 

#define L_FRAME 160 

input: 


Word1l6 scal_sig[PIT_MAX+L_FRAME]; 
Word1l6 scal_fac; 


Word1l6 L_frame, lag_min, lag_max; 
local variables: 
Word1l6 i, Jj, *p, *pl, p_max; 


Word32 tO, max; 


return: 
Word16 p_max; 


Original C code 


max = MIN_32; 
for (i = lag_max; i >= lag_min; i--) 
{ 
p scal_sig; 
pl = &scal_sig[-i]; 
tO = 0; 
for (j = 0; j < L_frame; j++, pt+, pltt) 
{ 
tO = L_mac (tO, *p, *pl); 
} 
if (L sub (t0, max) >= 0) 
{ 
max = t0; 
p_max = i; 
} 
} 
where L_mac(a,b,c) = _sadd(a,_smpy (b,c) ) 
L_sub(a,b) = _ssub(a,b) 
L_frame = L_FRAME/2 = 80 
and the search range (lag_min, lag_max) is (18,35) 


(pointed at scal_sig[PIT_MAX] 
(not used in this part of the code) 


when passed) 


yr (36571), Or (727-143) « 
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A.2.7.1 Rearranging The C Code and Unrolling The Loops 


This algorithm is preferable to smaller lag candidates, because it performs a 
comparison with if(L_sub (t0,max) >= 0) and the search starts from lag_max. 
Because there is not a single instruction for the >= (or <=) comparison, you can 
change the search order to start from lag_min to compare with if (tO > max); 


p_max is initialized to lag_min. The C code is modified as shown in 
Example A-36. 


Example A-—36. C Code for the Lag Search in lag_max ( ) (Comparison Order Changed) 


max = 
p_max lag_min; 
for Ci lag_min; i < lag_max; i++) 
{ 
p = scal_sig; 
pl = &scal_sig[-i]; 
tO = 0; 


for (j=0; j<L_frame; jt+t+, *pt++, *pl+t) 
=.L 


to _mac(t0O, *p, *pl); 


if (tO > max) 
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Next, look at the inner loop, a general MAC loop. Because “*p does not always 
equal “p1, it does not fall into the special case described in section A.2.1, Imple- 
mentation of the Multiply-Accumulate Loop, beginning on page A-4. Therefore, 
the performance cannot be improved by simply unrolling the inner loop. 


Now consider unrolling the outer loop once. The C code with outer loop 
unrolling is shown in Example A-37. Because the number of lags that needs 
to be searched within each search range is always even, such unrolling does 
not create an additional case to handle. 


Example A-37. C Code for the Lag Search in lag_max() With Outer Loop Unrolling 


Implementation of the GSM EFR Vocoder 


Word32 t1; 


max = MIN_32; 


p_max = lag_min; 
for (i = lag_min; i < lag_max; i+t=2) 
{ 

p = scal_sig; 

pl = scal_sig[-i]; 

tO = 0; 

tl = 0; 


for (j=0; j<L_frame; j++, ptt, pl++t) 


t0=_sadd(t0,_smpy(*p,*pl)); 
} 


if (tO > max) 


max = t0; 

p_max = i; 
} 
if( tl > max) 
{ 

max = tl; 


p_max = itl; 


with intrinsics substitutes. 


tl=_sadd(t1l,_smpy(*p,*-pl)); (or tl=_sadd(t1,_smpy (scal_sig[j],scal_sig[-i-1+]j])) 
(or t0=_sadd(t0,_smpy(scal_sig[4j],scal_sig[-i+j])) 


The smaller lag is always compared first in the order of the comparisons. 


The instructions required for one iteration of the inner loop are shown in 


Example A-38. 


Example A—38. Linear Assembly for the Lag Search in lag_max() Inner Loop 


INNERLOOP: 
LDH .D *pt+, sigj 
LDH -D *—pill, sidadli1 
SMPY .M sigj,scalijl,tmpl 
SADD -L tl1,tmpl1,tl 
LDH .D *pl++,scalij 
SMPY .M sigj,scalij,tmp0 
SADD .L t0,tmp0,t0 
[icntr] SUB As) LGNEY, Ly kCnEE 
[icntr] B ars) INNERLOOP 


load scal_sig[j] 
load scal_sig[-i-1+j] 
smpy (scal_sig[j],scal_sig[-i-1+]j]) 


tl=sadd(t1,smpy (scal_sig[j],scal_sig[-i-1+j] 


load scal_sig[-itj] 
smpy (scal_sig[j],scal_sig[-i+j]) 


t0=sadd(t0,smpy(scal_sig[j],scal_sig[-itj]) 


decrement inner loop counter 
branch to inner loop 
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The .D unit is used the most (three times). Therefore, the inner loop takes two 
cycles. 


Now unroll the inner loop once. The first iteration of t1 and the last iteration of 
tO perform outside the inner loop. This avoids memory bank hits. The C code 
with the inner and outer loops unrolled is shown in Example A-39. 


Example A—39.C Code for the Lag Search in lag_max() With Inner and Outer Loops Unrolled 


Word32 ti; 


max = MIN_32; 


p_max = lag_min; 
for (i = lag_min; i < lag_max; it=2) 
{ 
p = scal_sig; 
pl = scal_sig[-i]; 
tO = 0; 
t1l=_sadd(t1l,_smpy(*p,*-pl)); (or tl=_sadd(t1,_smpy (scal_sig[j],scal_sig[-i-1+]j])) 
for (j=0; j<(L_frame-1); jt=2, pt=2, pl+=2) { 
t0=_sadd(t0,_smpy(*p,*pl)); (or t0=_sadd(t0,_smpy(scal_sig[j],scal_sig[-i+j])) 
tl=_sadd(t1,_smpy(*+p,*pl)); (or tl=_sadd(t1,_smpy (scal_sig[jt+1],scal_sig[-i+j])) 
t0=_sadd(t0,_smpy(*+p,*+pl)); (or t0=_sadd(t0,_smpy (scal_sig[j+1],scal_sig[-i+ j+1]) 
t1l=_sadd(t1l,_smpy(*+p[2],*+pl)); (or tl=_sadd(t1,_smpy (scal_sig[j+2],scal_sig[-i+j+1]) ) 


} 
t0=_sadd(t0,_smpy (scal_sig[L_frame-1],scal_sig[-it+tL_frame-1])); 
if (tO > max) { 
max = t0; 
p_max = i; 
} 
if( tl > max) { 
max = tl; 


p_max = itl; 


Although five values of scal_sig, scal_sig[j], scal_sig[j+1], scal_sig[j+2], 
scal_sig|[—i+j], and scal_sig[—i+j+1], are required for each inner loop 
iteration, scal_sig[j] does not need to be loaded, because it was loaded in the 
previous iteration. This means only four loads are required per iteration. 
Example A—40 gives the instructions for the modified inner loop. 
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Example A—40. Linear Assembly for the Lag Search in lag_max() Inner Loop 


INNERLOOP: 


LD 
SM 


LD 
SM 


LD 
SM 


LD 


SM 


PY 


SADD 


SADD 


PY. 


SADD 


[icntr] SU 
[icntr] B 


B 


i] 


Rpt, 
*-pl, 
sigj, 


sigj 
scalijl 
sealijl,t1 


*ol+4+, scalij 
sigj,scalij,tmp0 
t0,tmp0,to 
*pt++, Sigjt+l1 
sigit1,scaii4, emel 
t1,tmpl,tl 
*pl++, scalij+l 
sigj+1,scalijt+1,tmp0 
t0,tmp0,to 
*pt+, sigjt2 


sigj+2,scalijt1l,tmpl 
ti; tmp ,l 
LORE 27 ent 
INNERLOOP 


load scal_sig[j] 
load scal_sig[-i-1+3j] 
tl=smpy (scal_sig[j],scal_sig[-i-1+]j]) 


load scal_sig[-itj] 

smpy (scal_sig[j],scal_sig[-itj]) 
t0=sadd(t0,smpy(scal_sig[j],scal_sig[-it+j]) 
load scal_sig[j+1l 
smpy (scal_sig[j+1],scal_sig[-i+j]) 
tl=sadd(t1,smpy(scal_sig[j+1],scal_sig[-it+j]) 
load scal_sig[-itj+i] 

smpy (scal_sig[j+1],scal_sig[-i+j+1]) 
t0=sadd(t0, smpy (scal_sig[j+1],scal_sig[-i+j+1]) 
load scal_sig[j+2], the scal_sig[j] for the 
next iteration 
smpy (scal_sig[j+2],scal_sig[-i+j+1]) 
tl=sadd(t1,smpy (scal_sig[j+2],scal_sig[-it+j+1]) 
decrement inner loop counter 

branch to inner loop 


The inner loop uses two cycles. You double the performance, therefore, by 
unrolling both the outer loop and inner loop if no memory bank hits occur. 


A.2.7.2 Avoiding Memory Bank Hits 


A.2.7.3 Final Assembly Code for Lag Search 


Load scal_sig[—i+j] and scal_sig[j+1] together and scal_sig[—i+j+1] and 
scal_sig[j+2] together to avoid memory bank hits. Memory bank hits can also 
be avoided by loading scal_sig[—i+j] and scal_sig|[—i+j+1] together and 
scal_sig[j+1] and scal_sig[j+2] together. 


The final assembly code for the lag search segment is shown in 


Example A-41. 
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Example A-41. Assembly Code for the Lag Search in lag_max() 


KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK 
k* k* 
ial Implementation of residu.c EFR ai 
k* * 
ae Compare two lags a time pes 
** ** 
tk Total cycles = 7+(L_frame+6) * (lag_max-lag_mint+1) /2 ak 
k* kk 
a Register Usage: A at 
K* 10 9 * 
k* k* 
KKK KKK KKK KKK KK KKK KKK KK KKK KKK KE KKK KKK KK KKK KKK KKK KKK KKK K KKK KKK KKKKKKKKKKKKKAKKKKK KKK KKK 
; A4 --- &scal_sig 
; A6é --- lag_max 
; B6 --- lag_min 
SUBAH -D1 A4,A6,A7 ; pl=&scal_sig[-LAG_MIN] 
MVK .S2 1,B2 
SUB .L1X B6,A6,Al1 ; the outer loop counter 
MV .L2X A4,B7 ; p=&scal_sig[0] 
MPY .M2 BO, 0,B0 ; initialize the comparison result 
MPY -M1 A2,0,A2 ; take care the initial iteration 
MV 3: Sil A6é,A4 7 P_max = lag_min 
SHL ao2 B2,31,B2 7; max=MIN_32=0x80000000L 
LDH -D1 *-A7[1],A5 ; scal_sig[-LAG_MIN-1] 
LDH .D2 *B7,B5 ; scal_sig[0] 
ADD -L1 Al,1,Al1 ; make the counter to be an even number 
OUTERLOOP: 
LDH -D1 *A7,A5 ; scal_sig[-LAG_MIN] 
LDH -D2 *+B7[1],B6 ; scal_sig[1] 
[A2] SADD ~L2 B10,B8,B10 
[Al] MV »o2 37,Bl1 ; inner loop counter 
MPY -M1 A3,0,A3 
MPY .M2 B8,0,B8 
ADD .S1 AT7,2,A9 ; &scal_sig[-LAG_MIN+1] 
SUB -L1 A7,4,A7 ; update pl = &scal_sig[-LAG_MIN-2] 
LDH -D1 *AQ++,A5 ; scal_sig[-LAG_MIN+1] 
LDH «D2 *4+B7(2],B5 ; scal_sig[2] 
[B1] B o INNERLOOP ; branch to the inner loop 
[A2] CMPGT 2 B10,B2,B0 ; if (t0>max) 
LDH «D1 *A9++,A5 ; scal_sig[-LAG_MIN+2] 
LDH ~D2 *+B7[3],B6 ; scal_sig[3] 
[BO] MV .L2 B10,B2 ; max = tO 
MPY .M1X B1,1,A2 ; counter to branch to the outerloop 
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Example A-41. Assembly Code for the Lag Search in lag_max() (Continued) 


INNERLOOP: 


LDH 
LDH 


qr p 
WOU 
GO 


WNMNNNnNNH EH 
U 
K 


[!A2] 


A 
Oo 
an) 


ne 
> oO 
te 


DD 
SADD 
[!A1]B 


FINISH: 


NOP 


-D1 
-D2 
02 
-L2X 
LL 
~Sl 
»M1 
M2 


-L2X 


*A9++,A5 
*+B7[4],B5 
INNERLOOP 
AO0,B2,B0 
A6,2,A4 
A6,2,A6 
AO,0,A0 
B10,0,B10 


*A9+4+,A5 
*4+B7[5],B6 
AS5,B5,A3 
A0,B2 
A6,3,A4 
Al,2,Al 
B7,12,B9 


*A9++,A5 
*B9O++,B5 
A5,B6,A3 
A5,B5,B8 
AO,A3,A0 
B10,B8,B10 
INNERLOOP 
Bl,1,Bl 


*A9++,A5 
*BO++,B6 
A5,B5,A3 
A5,B6,B8 
A0O,A3,A0 
B10,B8,B10 
A2,1,A2 
OUTERLOOP 


*-A7[1],A5 
*B1;BS 
AO0,A3,A0 
B10,B8,B10 
FINISH 


scal_sig[-LAG_MIN+3] 
scal_sig[4] 

branch to the inner loop 
if (t1>max) 

p_max = i 

update i 

initialize t1=0 
initialize t0=0 


scal_sig[-LAG_MIN+4] 
scal_sig[5] 

_smpy (scal_sig[-LAG_MIN-1], 
max = tl 

p_max = itl 

update inner loop counter 
&scal_sig[1] 


scal_sig[0]) 


scal_sig[-LAG_MIN+5] 
scal_sig[6] 

_smpy (scal_sig[-LAG_MIN], 
_smpy (scal_sig[-LAG_MIN], 
update tl 

update t0 

branch to inner loop 
decrement inner loop counter 


scal_sig[1]) 
scal_sig[0]) 


scal_sig[-LAG_MIN+6] 
scal_sig[7] 

_smpy (Sscal_sig[-LAG_MIN+1], 
_smpy (scal_sig[-LAG_MIN+1], 
update tl 

update t0 

decrement the counter to branch to the outer loop 
branch to the outer loop 


scal_sig[2]) 
scal_sig[1]) 


scal_sig[-LAG_MIN-3] 
scal_sig[0] 

update tl 

update t0 

lag search is complete 


Allthe epilogs and prologs of the outer and inner loops are compressed to mini- 
mize the code size. A2 is both the indicator for avoiding comparisons during 


the initial iteration of the outer loop and the counter for branching to the outer 


loop during inner loop executions. 
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C code with loop unrolling 6-16 
linear assembly for inner loop with 
LDW 6-17 
linear assembly for inner loop with LDW and 
allocated resources 6-21 
linear assembly for inner loop with conditional 
SUB instruction 6-27 
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epilog 4-20 

execute packet 2-11, 2-15, 6-36 

execution cycles, reducing number of 6-4 
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optimized form 4-17 
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with redundant load elimination 6-107 
finalassembly 6-143 
for inner loop 6-116 
with redundant load elimination 6-112 
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hits 6-125 
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linear assembly 
for inner loop 6-108 
for outer loop 6-134 
for unrolled inner loop 6-119 
for unrolled inner loop with .mptr direc- 
tive 6-121 
with inner loop unrolled 6-133 
with outer loop conditionally executed with in- 
nerloop 6-137, 6-139 
software pipelining the outer loop 6-127 
using word access in 4-16 
with inner loop unrolled 6-118 
fixed-point, dot product 
linear assembly for inner loop with LDW_ 6-16 
linear assembly for inner loop with LDW and allo- 
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floatdatatype 4-2 
floating-point, dot product 
dependency graph with LDW_ 6-21 
linear assembly for inner loop with LDDW_ 6-17 
linear assembly for inner loop with LDDW with 
allocated resources 6-21 
flow diagram 
autocorr.c A-9, A-12, A-13 
code development 1-3 
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description 5-7 
in assembly code 5-7 
reassigning for parallel execution 6-10, 6-12 
functions 
clock() 4-2 
printf() 4-2 


Index-4 


-—goption 2-5 
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global systems for mobile communications 
(GSM) A-3 


if-then-else 
branching versus conditional instructions 6-82 
C code 6-82, 6-90 
finalassembly 6-87, 6-88, 6-95 
linear assembly 6-83, 6-86, 6-91, 6-94 
IIR filter, C code 6-73 
iir1.asm, inner loop kernel 2-16 
iir1.c example code 2-4 
in-flight value 7-3 
index search in search_10i40 A-38 
information elements in tutorial 2-2 
inserting moves 6-101 
instructions, placement in assembly code 5-4 
intdatatype 4-2 
interrupt subroutines 7-8 to 7-10 
hand-coded assembly allowing nested inter- 
rupts 7-10 
nested interrupts 7-9 
with hand-coded assembly 7-9 
with the C compiler 7-8 
interruptible 
code generation 7-6 to 7-7 
loops 7-5 
interrupts 
overview 7-2 
single assignment versus multiple assign- 
ment 7-3 to 7-4 


intrinsics 
_add2() 4-14 
_mpy() 4-15 
_mpyh() 4-15 
_mpyhl() 4-14 
_mpylh() 4-14 


_nassert 4-22 

described 2-18, 4-9 

inresidu.c A-51 to A-53 

in saturated add 4-9 

summary table 4-10 to 4-12 
iteration interval, defined 6-28 


—k compiler option 2-5, 4-4 
kernel 
loop 2-14, 4-7, 4-20 
ofiirl.asm code 2-16 
of iir2.asm code 2-23 
of iir3.asm code 2-29 
of maci.asm code 2-14 
of vec_mpy1l.asm code 2-15 
of vec_mpy2.asm code 2-22 


—Ilinker option 2-6 
labels in assembly code 5-2 
lag search in lag_max() A-56 


linear, optimizing (phase 3 of flow), in flow dia- 
gram 1-3 
linear assembly 2-25 
code 
autocorr.c, one iteration of loop A-9 
dot product, fixed-point 6-5 
dot product, fixed-point 6-10, 6-16, 6-20, 
6-26, 6-35 
dot product, floating-point 6-17, 6-21, 6-27, 
6-36 
FIR filter 6-108, 6-110, 6-119, 6-121 
FIR filter with outer loop conditionally execut- 
ed with inner loop 6-137, 6-139 
FIR filter, outer loop 6-134 
FIR filter, unrolled 6-133 
if-then-else 6-86, 6-94 
index search in search_10i40 A-41 
lag search in lag_max() A-59 
live-too-long 6-103 
MAC loop A-4 
rrv computation in search_10i40 A-28, A-31 
special MAC loop A-5 
weighted vectorsum 6-58 
resource allocation 
conflicts 6-61 
dot product 6-19 
if-then-else 6-86, 6-93 
IIR filter 6-78 
in writing parallel code 6-6 
live-too-long resolution 6-102 
weighted vectorsum 6-58 


Index 


linker command file 2-6 
little-endian mode 
and MPY operation 6-17 
runtime support (rts6201.lib) 2-6 
live-too-long 
code 6-63 
Ccode 6-97 
inserting move (MV) instructions 6-101 
unrolling the loop 6-101 
issues 6-97 
and software pipelining 4-27 
created by split-join paths 6-100 
load 
doubleword (LDDW) instruction 6-15 
word (LDW) instruction 6-15 
Load Program File dialog box (debugger) 2-8 
load6x 2-12, 2-13 
long datatype 4-2 
loop 
carry path, described 6-73 
control variable, conditionally incremented 4-27 
counter, handling odd-numbered 4-15 
interruptible 7-5 
iterations 4-21 
kernel 2-14 
unrolling 
as major programming method A-2 
dot product 6-15 
for simple loop structure 4-25 
for windowing and scaling in autocorr.c A-9 
if-then-else code 6-90 
incor_h A-22 
in FIR filter 6-118, 6-121, 6-127, 6-132, 
6-134 
inlag_max A-58 
in live-too-long solution 6-101 
invectorsum 4-23 


maci.asm kernel, inner loop 2-14 
maci.c example code 2-3 
memory bank hits 

avoiding A-2 

cor_h A-23 

in windowing and scaling in autocorr.c A-15 
memory bank scheme, interleaved 6-114 to 6-116 
—mg compiler option 2-5 
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minimum iteration interval, determining 6-30 
for FIR code 6-110, 6-124, 6-142 
for if-then-else code 6-85, 6-93 
for IIR code 6-76 
for live-too-long code 6-100 
for weighted vector sum 6-55, 6-56 
mnemonic (instruction) 5-4 
modulo iteration interval table 
dot product, fixed-point 
after software pipelining 6-31 
before software pipelining 6-28 
dot product, floating-point 
after software pipelining 6-32 
before software pipelining 6-29 
IIR filter, 4-cycle loop 6-79 
weighted vector sum 
2-cycle loop 6-60, 6-65, 6-68 
with SHR instructions 6-62 
modulo-scheduling technique, multicycle 
loops 6-54 
move (MV) instruction 6-101 
_mpy intrinsic 4-15 
tutorial 2-18 
_mpyh () intrinsic 4-15 
_mpyhlintrinsic 4-14 
_mpylh intrinsic 4-14 
tutorial 2-18 
multicycle instruction, staggered accumula- 
tion 6-33 
multiple assignment, code example 7-3 
multiply accumulate function 
inner loop kernel of original assembly 
code 2-14 
original C code 2-3 
multiply-accumulate loop (MAC), implementation in 
vocoder application A-4 
—mw compiler option 3-3 


_nassertintrinsic 4-12, 4-14, 4-22 
node 6-6 


—o compiler option 2-5, 4-4, 4-20, 4-22, 4-27 
—o linker option 2-6 
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operands 
placement in assembly code 5-8 
types of 5-8 


optimization checklist 3-1 to 3-5 
optimizing assembly code, introduction 6-2 
optional tasks in tutorial 2-2 


outer loop conditionally executed with inner 
loop 6-132 


OUTLOOP 6-111, 6-124 


parallel bars, in assembly code 5-2 
parent instruction 6-6 

parent node 6-6 

path in dependency graph 6-6 


performance analysis 
index search in search_10i40 A-40 
of Ccode 4-2 
of dot product examples 6-14, 6-24, 6-53 
of FIR filter code 6-124, 6-131, 6-145 
of if-then-else code 6-89, 6-96 
residu.c A-52 


pipeline in’C6x 1-2 

—pm compiler option 4-4, 4-5, 4-8, 4-22 
pointer operands 5-8 

preparation for tutorial 2-1 

primary tasks in tutorial 2-2 

priming the loop, described 6-47 
priming the pipeline 4-21 
printf () function 4-2 

processor mnemonics 5-5 


Profile 
Marking dialog box 2-9 
menu (debugger) 2-8 
Run dialog box 2-10 


profiling 2-8 to 2-13 

program-level optimization 4-5 
programming methods, summary of A-2 
prolog 4-20, 6-47, 6-49 


pseudo-code, for single-cycle accumulator with 
ADDSP_ 6-33 


redundant 
load elimination 6-106 
loops 4-22 
reg directive 2-25, 6-16, 6-17 
register 


allocation 6-123 
operands 5-8 
partitioning A-41 
residu.c (FIR filter in EFR) A-51 
resource 
conflicts 
described 6-61 
live-too-long issues 6-63, 6-97 
table 
FIR filter code 6-110, 6-124, 6-142 
if-then-else code 6-85, 6-93 
IIR filter code 6-76 
live-too-long code 6-100 


routines 
autocorr.c A-7 
cor h A-20 


lag_max() A-56 
rrv computation in search_10i40 A-27 
rts6201.lib file 2-6 
rts6201e.lib file 2-6 
RUNB debugger command 4-3 


.saextension 2-25 
_saddintrinsic 4-9, 4-12 
scheduling table. See modulo iteration interval table 
shell program (cl6x) 2-5, 4-4 
short 
arrays 4-14 
datatype 4-2,4-14 
single assignment, code example 7-4 
software pipeline 4-20, 4-24 
accumulation, staggered results due to 3-cycle 
delay 6-34 
described 6-25 
when not used 4-26 


software-pipelined schedule, creating 6-30 
source operands 5-8 
split-join path 6-97, 6-98, 6-100 
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stand-alone simulator (load6x) 2-12, 4-2 
SunOS shell initialization 2-7 
symbolic names, for data and pointers 6-16, 6-17 


techniques 
for priming the loop 6-47 
for refining C code 4-9 
for removing extra instructions 6-41, 6-51 
using intrinsics 4-9 
word access for short data 4-14 
TMS320C6x pipeline 1-2 
translating C code to 'C6x instructions 
dot product 
fixed-point, unrolled 6-16 
floating-point, unrolled 6-17 
IIR filter 6-74 
with reduced loop carry path 6-78 
weighted vector sum 6-54 
unrolled inner loop 6-56 
translating C code to linear assembly, dot product, 
fixed-point 6-5 
trip count 2-25, 4-21 
communicating information to the compiler 4-22 
determining the minimum 4-21 
trip counter 
converting to adowncounting loop 4-27 
defined 4-21 
trip directive 2-25 


vec_mpyt.asm kernel, inner loop 2-15 
vec_mpy1.cexample code 2-4 
vector multiply function 
C with word instructions and intrinsics 2-18 
inner loop kernel 
of assembly from C with intrinsics 2-22 
of original assembly code 2-15 
original C code 2-4 
tutorial C code example (vec_mpy1.c) 2-4 
vector sum function 
See also weighted vector sum 
Ccode 4-5 
with const keyword 4-7 
with const keywords and_nassert 4-22 
with const keywords, _nassert, word 
reads 4-14 
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vector sum function (continued) 
C code 
with const keywords, _nassert, word reads, 
and loop unrolling 4-24 
with const keywords,_nassert, and word reads 
(generic) 4-15 
with three memory operations 4-23 
word-aligned 4-23 
compiler output (original assembly code) 4-8 
dependency graph 4-6, 4-7 
handling odd-numbered loop counter with 4-15 
handling short-aligned data with 4-15 
rewriting to use word accesses 4-14 


VelociTl 1-2 
very long instruction word (VLIW) 1-2 


vocoder 
application A-1 
implementing A-3 
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weighted vector sum 
C code 6-54 
unrolled version 6-55 
finalassembly 6-71 
linear assembly 6-69 
for inner loop 6-54 
with resources allocated 6-58 
translating C code to assembly instruc- 
tions 6-56 
windowing and scaling, autocorr.c A-7 
word access 
in dot product 4-15 to 4-16 
in FIR filter 4-16 
using for short data 4-14 to 4-19 


-z compiler option 2-6 


